Complete Key-Value cache for all transformer layers. More...

#include <model.h>

Collaboration diagram for KVCache:

Public Member Functions
void	initialize (const ModelConfig &config, int total_num_model_layers, int num_gpu_layers_to_allocate, int max_seq_len_arg, int num_kv_heads, int head_dim, int max_batch_size_arg=1)
	Initializes the KV cache with given dimensions.

void	clear_data ()

void	initialize_batch (int batch_size)
	Initialize batch mode with specified number of sequences.

void	destroy_gpu_resources ()

	~KVCache ()

Public Attributes
std::vector< KVCacheLayer >	layers

int	seq_len = 0

std::vector< int >	batch_seq_lens

int	max_batch_size = 1

int	current_batch_size = 0

int	total_model_layers_ = 0

int	max_seq_len_config_ = 0

Detailed Description

Complete Key-Value cache for all transformer layers.

Manages the KV cache across all layers of the transformer model, including memory management for both CPU and GPU implementations. Supports both single-sequence and multi-sequence batch processing.

Definition at line 151 of file model.h.

Constructor & Destructor Documentation

◆ ~KVCache()

KVCache::~KVCache ( )

inline

Definition at line 224 of file model.h.

               {
        destroy_gpu_resources();
    }

References destroy_gpu_resources().

Member Function Documentation

◆ clear_data()

void KVCache::clear_data ( )

inline

Definition at line 180 of file model.h.

                      {
        // Single-sequence mode (legacy compatibility)
        seq_len = 0;
        
        // Multi-sequence mode 
        current_batch_size = 0;
        batch_seq_lens.clear();
        
        // For batch processing, we MUST clear the actual KV data to prevent cross-sequence contamination
        for (auto& layer : layers) {
            std::fill(layer.k.begin(), layer.k.end(), 0.0f);
            std::fill(layer.v.begin(), layer.v.end(), 0.0f);
        }
        
        // Logger::debug("[KVCache] clear_data() called. seq_len reset to 0. K/V vectors cleared for batch processing.");
    }

References batch_seq_lens, current_batch_size, layers, and seq_len.

Referenced by tinyllama::TinyLlamaSession::generate(), and tinyllama::TinyLlamaSession::generate_batch().

◆ destroy_gpu_resources()

void KVCache::destroy_gpu_resources ( )

Definition at line 217 of file kv_cache.cpp.

                                    {
    // No-op for CPU-only builds
}

Referenced by ~KVCache().

◆ initialize()

void KVCache::initialize	(	const ModelConfig &	config,
		int	total_num_model_layers,
		int	num_gpu_layers_to_allocate,
		int	max_seq_len_arg,
		int	num_kv_heads,
		int	head_dim,
		int	max_batch_size_arg = `1`
	)

Initializes the KV cache with given dimensions.

Parameters

config	The model configuration, used to determine if KVCache quantization is enabled.
total_num_model_layers	Total number of layers in the model (for sizing CPU cache vectors)
num_gpu_layers_to_allocate	Number of layers for which to allocate GPU device memory. Can be 0.
max_seq_len	Maximum sequence length to cache
num_kv_heads	Number of key/value heads
head_dim	Dimension of each attention head
max_batch_size_arg	Maximum number of sequences for batch processing (default: 1 for single-sequence)

Definition at line 10 of file kv_cache.cpp.

                                                               {
  this->total_model_layers_ = total_num_model_layers;
  this->max_seq_len_config_ = max_seq_len_arg;
  this->max_batch_size = max_batch_size_arg;
  this->current_batch_size = 0;
  this->batch_seq_lens.clear();
  this->batch_seq_lens.resize(max_batch_size_arg, 0);
  layers.resize(total_num_model_layers);
  seq_len = 0;
  Logger::info("Allocating KVCache host vectors...");
  size_t cache_size_per_layer = static_cast<size_t>(max_seq_len_arg) *
                                static_cast<size_t>(max_batch_size_arg) *
                                static_cast<size_t>(num_kv_heads) *
                                static_cast<size_t>(head_dim);
  if (cache_size_per_layer == 0 && max_seq_len_arg > 0 && total_num_model_layers > 0) {
    throw std::runtime_error(
        "KVCache (CPU): Calculated cache size is zero for non-empty model. Check parameters.");
  }
 
  for (int l = 0; l < total_num_model_layers; ++l) {
    try {
      layers[l].k.assign(cache_size_per_layer, 0.0f);
      layers[l].v.assign(cache_size_per_layer, 0.0f);
    } catch (const std::bad_alloc& e) {
      Logger::error("Failed to allocate CPU KVCache for layer " +
                    std::to_string(l) + ": " + e.what());
      throw;
    }
  }
  Logger::info("KVCache (CPU) vectors allocated for " +
               std::to_string(total_num_model_layers) + " layers.");
 
#ifdef HAS_CUDA
  this->allocated_num_layers = num_gpu_layers_to_allocate;
  this->allocated_max_seq_len = max_seq_len_arg;
  this->allocated_num_kv_heads = num_kv_heads;
  this->allocated_head_dim = head_dim;
 
  if (num_gpu_layers_to_allocate > 0) {
      if (num_gpu_layers_to_allocate > total_num_model_layers) {
          Logger::warning("KVCache::initialize: num_gpu_layers_to_allocate (" + std::to_string(num_gpu_layers_to_allocate) +
                          ") > total_num_model_layers (" + std::to_string(total_num_model_layers) + 
                          "). Clamping to total_num_model_layers.");
          this->allocated_num_layers = total_num_model_layers;
          num_gpu_layers_to_allocate = total_num_model_layers;
      }
 
      size_t cache_elems_per_layer_gpu = static_cast<size_t>(max_seq_len_arg) *
                                 static_cast<size_t>(num_kv_heads) *
                                 static_cast<size_t>(head_dim);
      
      size_t fp32_cache_bytes_per_layer_gpu = cache_elems_per_layer_gpu * sizeof(float);
      size_t int8_cache_bytes_per_layer_gpu = cache_elems_per_layer_gpu * sizeof(int8_t);
      size_t num_scales_per_layer_gpu = static_cast<size_t>(max_seq_len_arg) * static_cast<size_t>(num_kv_heads);
      size_t scales_bytes_per_layer_gpu = num_scales_per_layer_gpu * sizeof(float);
 
      if (cache_elems_per_layer_gpu == 0 && config.use_kvcache_quantization) {
        throw std::runtime_error(
            "KVCache (CUDA INT8): Calculated cache elements per layer is zero. Check parameters.");
      } else if (cache_elems_per_layer_gpu == 0) {
        throw std::runtime_error(
            "KVCache (CUDA FP32): Calculated cache elements per layer is zero. Check parameters.");
      }
 
      if (config.use_kvcache_quantization) {
        Logger::info("Allocating INT8 KVCache + FP32 Scales on GPU for " + std::to_string(num_gpu_layers_to_allocate) +
                 " layers. Data size per layer: " +
                     std::to_string(int8_cache_bytes_per_layer_gpu / (1024.0 * 1024.0)) +
                 " MB. Scales size per layer: " + 
                     std::to_string(scales_bytes_per_layer_gpu / (1024.0 * 1024.0)) + " MB");
      } else {
        Logger::info("Allocating FP32 KVCache on GPU for " + std::to_string(num_gpu_layers_to_allocate) +
                 " layers, size per layer: " +
                     std::to_string(fp32_cache_bytes_per_layer_gpu / (1024.0 * 1024.0)) +
                 " MB");
      }
 
      int gpu_layer_start_model_idx = this->total_model_layers_ - num_gpu_layers_to_allocate;
      Logger::info("KVCache GPU allocation will target model layers from index " + std::to_string(gpu_layer_start_model_idx) +
                   " to " + std::to_string(gpu_layer_start_model_idx + num_gpu_layers_to_allocate - 1));
 
      for (int i = 0; i < num_gpu_layers_to_allocate; ++i) {
        int current_model_idx_for_gpu = gpu_layer_start_model_idx + i;
 
        if (current_model_idx_for_gpu < 0 || static_cast<size_t>(current_model_idx_for_gpu) >= layers.size()) {
            Logger::error("KVCache::initialize: Calculated current_model_idx_for_gpu (" + std::to_string(current_model_idx_for_gpu) + ") is out of bounds for layers vector (size " + std::to_string(layers.size()) + "). Skipping this layer.");
            continue;
        }
 
        if (layers[current_model_idx_for_gpu].k_dev_fp32) {
          Logger::warning(
              "KVCache::initialize: Re-initializing KVCache layer " + std::to_string(current_model_idx_for_gpu) + " K dev fp32 pointer without proper destruction?");
          gpuErrchk(cudaFree(layers[current_model_idx_for_gpu].k_dev_fp32));
          layers[current_model_idx_for_gpu].k_dev_fp32 = nullptr;
        }
        if (layers[current_model_idx_for_gpu].v_dev_fp32) {
          Logger::warning(
              "KVCache::initialize: Re-initializing KVCache layer " + std::to_string(current_model_idx_for_gpu) + " V dev fp32 pointer without proper destruction?");
          gpuErrchk(cudaFree(layers[current_model_idx_for_gpu].v_dev_fp32));
          layers[current_model_idx_for_gpu].v_dev_fp32 = nullptr;
        }
        if (layers[current_model_idx_for_gpu].k_dev_quantized) {
          Logger::warning(
              "KVCache::initialize: Re-initializing KVCache layer " + std::to_string(current_model_idx_for_gpu) + " K dev quantized pointer without proper destruction?");
          gpuErrchk(cudaFree(layers[current_model_idx_for_gpu].k_dev_quantized));
          layers[current_model_idx_for_gpu].k_dev_quantized = nullptr;
        }
        if (layers[current_model_idx_for_gpu].v_dev_quantized) {
          Logger::warning(
              "KVCache::initialize: Re-initializing KVCache layer " + std::to_string(current_model_idx_for_gpu) + " V dev quantized pointer without proper destruction?");
          gpuErrchk(cudaFree(layers[current_model_idx_for_gpu].v_dev_quantized));
          layers[current_model_idx_for_gpu].v_dev_quantized = nullptr;
        }
        if (layers[current_model_idx_for_gpu].k_dev_scales) {
          Logger::warning(
              "KVCache::initialize: Re-initializing KVCache layer " + std::to_string(current_model_idx_for_gpu) + " K dev scales pointer without proper destruction?");
          gpuErrchk(cudaFree(layers[current_model_idx_for_gpu].k_dev_scales));
          layers[current_model_idx_for_gpu].k_dev_scales = nullptr;
        }
        if (layers[current_model_idx_for_gpu].v_dev_scales) {
          Logger::warning(
              "KVCache::initialize: Re-initializing KVCache layer " + std::to_string(current_model_idx_for_gpu) + " V dev scales pointer without proper destruction?");
          gpuErrchk(cudaFree(layers[current_model_idx_for_gpu].v_dev_scales));
          layers[current_model_idx_for_gpu].v_dev_scales = nullptr;
        }
        
        if (config.use_kvcache_quantization) {
            gpuErrchk(cudaMalloc(&layers[current_model_idx_for_gpu].k_dev_quantized, int8_cache_bytes_per_layer_gpu));
            gpuErrchk(cudaMalloc(&layers[current_model_idx_for_gpu].v_dev_quantized, int8_cache_bytes_per_layer_gpu));
            gpuErrchk(cudaMalloc(&layers[current_model_idx_for_gpu].k_dev_scales, scales_bytes_per_layer_gpu));
            gpuErrchk(cudaMalloc(&layers[current_model_idx_for_gpu].v_dev_scales, scales_bytes_per_layer_gpu));
 
            gpuErrchk(cudaMemset(layers[current_model_idx_for_gpu].k_dev_quantized, 0, int8_cache_bytes_per_layer_gpu));
            gpuErrchk(cudaMemset(layers[current_model_idx_for_gpu].v_dev_quantized, 0, int8_cache_bytes_per_layer_gpu));
            gpuErrchk(cudaMemset(layers[current_model_idx_for_gpu].k_dev_scales, 0, scales_bytes_per_layer_gpu));
            gpuErrchk(cudaMemset(layers[current_model_idx_for_gpu].v_dev_scales, 0, scales_bytes_per_layer_gpu));
        } else {
            gpuErrchk(cudaMalloc(&layers[current_model_idx_for_gpu].k_dev_fp32, fp32_cache_bytes_per_layer_gpu));
            gpuErrchk(cudaMalloc(&layers[current_model_idx_for_gpu].v_dev_fp32, fp32_cache_bytes_per_layer_gpu));
            gpuErrchk(cudaMemset(layers[current_model_idx_for_gpu].k_dev_fp32, 0, fp32_cache_bytes_per_layer_gpu));
            gpuErrchk(cudaMemset(layers[current_model_idx_for_gpu].v_dev_fp32, 0, fp32_cache_bytes_per_layer_gpu));
        }
  }
      Logger::info("KVCache GPU allocation and zeroing complete for " + std::to_string(num_gpu_layers_to_allocate) + " layers.");
  } else {
      Logger::info("KVCache: No GPU layers requested for allocation (num_gpu_layers_to_allocate is 0). Skipping GPU KVCache allocation.");
      this->allocated_num_layers = 0; 
  }
 
#else
  Logger::info("KVCache (CPU-only build) initialized with dimensions for " +
               std::to_string(total_num_model_layers) + " layers, " +
               std::to_string(max_seq_len_arg) + " seq len, " +
               std::to_string(num_kv_heads) + " KV heads, " +
               std::to_string(head_dim) + " head dim");
#endif
} 

References batch_seq_lens, current_batch_size, Logger::error(), Logger::info(), layers, max_batch_size, max_seq_len_config_, seq_len, total_model_layers_, ModelConfig::use_kvcache_quantization, and Logger::warning().

Referenced by tinyllama::TinyLlamaSession::TinyLlamaSession().

◆ initialize_batch()

void KVCache::initialize_batch ( int batch_size )

inline

Initialize batch mode with specified number of sequences.

Parameters

batch_size Number of sequences to process in batch

Definition at line 201 of file model.h.

                                          {
        if (batch_size > max_batch_size) {
            Logger::warning("Requested batch size " + std::to_string(batch_size) + 
                           " exceeds max batch size " + std::to_string(max_batch_size) + 
                           ". Using max batch size.");
            batch_size = max_batch_size;
        }
        current_batch_size = batch_size;
        batch_seq_lens.resize(batch_size, 0);
    }

References batch_seq_lens, current_batch_size, max_batch_size, and Logger::warning().

Referenced by tinyllama::TinyLlamaSession::generate_batch().

Member Data Documentation

◆ batch_seq_lens

std::vector<int> KVCache::batch_seq_lens

Sequence lengths for each sequence in batch

Definition at line 158 of file model.h.

Referenced by clear_data(), TinyLlamaModel::forward_cpu_batch_generation(), initialize(), and initialize_batch().

◆ current_batch_size

int KVCache::current_batch_size = 0

Current number of active sequences

Definition at line 160 of file model.h.

Referenced by clear_data(), TinyLlamaModel::forward_cpu_batch_generation(), initialize(), and initialize_batch().

◆ layers

std::vector<KVCacheLayer> KVCache::layers

KV cache for each layer

Definition at line 152 of file model.h.

Referenced by clear_data(), TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), initialize(), update_kv_cache_batch_cpu(), and update_kv_cache_batch_cpu_sequence_aware().

◆ max_batch_size

int KVCache::max_batch_size = 1

Maximum number of sequences that can be cached

Definition at line 159 of file model.h.

Referenced by initialize(), initialize_batch(), update_kv_cache_batch_cpu(), and update_kv_cache_batch_cpu_sequence_aware().

◆ max_seq_len_config_

int KVCache::max_seq_len_config_ = 0

Store the original max_seq_len

Definition at line 163 of file model.h.

Referenced by TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), initialize(), update_kv_cache_batch_cpu(), and update_kv_cache_batch_cpu_sequence_aware().

◆ seq_len

int KVCache::seq_len = 0

Current sequence length (single-sequence mode)

Definition at line 155 of file model.h.

Referenced by clear_data(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), tinyllama::TinyLlamaSession::generate(), tinyllama::TinyLlamaSession::generate_batch(), and initialize().

◆ total_model_layers_

int KVCache::total_model_layers_ = 0

Total number of layers in the model

Definition at line 162 of file model.h.

Referenced by initialize().

The documentation for this struct was generated from the following files:

Public Member Functions

Public Attributes

Detailed Description

Constructor & Destructor Documentation

◆ ~KVCache()

Member Function Documentation

◆ clear_data()

◆ destroy_gpu_resources()

◆ initialize()

◆ initialize_batch()

Member Data Documentation

◆ batch_seq_lens

◆ current_batch_size

◆ layers

◆ max_batch_size

◆ max_seq_len_config_

◆ seq_len

◆ total_model_layers_