TinyLlama.cpp 1.0
A lightweight C++ implementation of the TinyLlama language model
Loading...
Searching...
No Matches
Public Member Functions | Public Attributes | List of all members
KVCache Struct Reference

Complete Key-Value cache for all transformer layers. More...

#include <model.h>

Collaboration diagram for KVCache:
Collaboration graph

Public Member Functions

void initialize (const ModelConfig &config, int total_num_model_layers, int num_gpu_layers_to_allocate, int max_seq_len_arg, int num_kv_heads, int head_dim, int max_batch_size_arg=1)
 Initializes the KV cache with given dimensions.
 
void clear_data ()
 
void initialize_batch (int batch_size)
 Initialize batch mode with specified number of sequences.
 
void destroy_gpu_resources ()
 
 ~KVCache ()
 

Public Attributes

std::vector< KVCacheLayerlayers
 
int seq_len = 0
 
std::vector< int > batch_seq_lens
 
int max_batch_size = 1
 
int current_batch_size = 0
 
int total_model_layers_ = 0
 
int max_seq_len_config_ = 0
 

Detailed Description

Complete Key-Value cache for all transformer layers.

Manages the KV cache across all layers of the transformer model, including memory management for both CPU and GPU implementations. Supports both single-sequence and multi-sequence batch processing.

Definition at line 151 of file model.h.

Constructor & Destructor Documentation

◆ ~KVCache()

KVCache::~KVCache ( )
inline

Definition at line 224 of file model.h.

224 {
226 }
void destroy_gpu_resources()
Definition kv_cache.cpp:217

References destroy_gpu_resources().

Member Function Documentation

◆ clear_data()

void KVCache::clear_data ( )
inline

Definition at line 180 of file model.h.

180 {
181 // Single-sequence mode (legacy compatibility)
182 seq_len = 0;
183
184 // Multi-sequence mode
186 batch_seq_lens.clear();
187
188 // For batch processing, we MUST clear the actual KV data to prevent cross-sequence contamination
189 for (auto& layer : layers) {
190 std::fill(layer.k.begin(), layer.k.end(), 0.0f);
191 std::fill(layer.v.begin(), layer.v.end(), 0.0f);
192 }
193
194 // Logger::debug("[KVCache] clear_data() called. seq_len reset to 0. K/V vectors cleared for batch processing.");
195 }
std::vector< KVCacheLayer > layers
Definition model.h:152
int seq_len
Definition model.h:155
std::vector< int > batch_seq_lens
Definition model.h:158
int current_batch_size
Definition model.h:160

References batch_seq_lens, current_batch_size, layers, and seq_len.

Referenced by tinyllama::TinyLlamaSession::generate(), and tinyllama::TinyLlamaSession::generate_batch().

◆ destroy_gpu_resources()

void KVCache::destroy_gpu_resources ( )

Definition at line 217 of file kv_cache.cpp.

217 {
218 // No-op for CPU-only builds
219}

Referenced by ~KVCache().

◆ initialize()

void KVCache::initialize ( const ModelConfig config,
int  total_num_model_layers,
int  num_gpu_layers_to_allocate,
int  max_seq_len_arg,
int  num_kv_heads,
int  head_dim,
int  max_batch_size_arg = 1 
)

Initializes the KV cache with given dimensions.

Parameters
configThe model configuration, used to determine if KVCache quantization is enabled.
total_num_model_layersTotal number of layers in the model (for sizing CPU cache vectors)
num_gpu_layers_to_allocateNumber of layers for which to allocate GPU device memory. Can be 0.
max_seq_lenMaximum sequence length to cache
num_kv_headsNumber of key/value heads
head_dimDimension of each attention head
max_batch_size_argMaximum number of sequences for batch processing (default: 1 for single-sequence)

Definition at line 10 of file kv_cache.cpp.

13 {
14 this->total_model_layers_ = total_num_model_layers;
15 this->max_seq_len_config_ = max_seq_len_arg;
16 this->max_batch_size = max_batch_size_arg;
17 this->current_batch_size = 0;
18 this->batch_seq_lens.clear();
19 this->batch_seq_lens.resize(max_batch_size_arg, 0);
20 layers.resize(total_num_model_layers);
21 seq_len = 0;
22 Logger::info("Allocating KVCache host vectors...");
23 size_t cache_size_per_layer = static_cast<size_t>(max_seq_len_arg) *
24 static_cast<size_t>(max_batch_size_arg) *
25 static_cast<size_t>(num_kv_heads) *
26 static_cast<size_t>(head_dim);
27 if (cache_size_per_layer == 0 && max_seq_len_arg > 0 && total_num_model_layers > 0) {
28 throw std::runtime_error(
29 "KVCache (CPU): Calculated cache size is zero for non-empty model. Check parameters.");
30 }
31
32 for (int l = 0; l < total_num_model_layers; ++l) {
33 try {
34 layers[l].k.assign(cache_size_per_layer, 0.0f);
35 layers[l].v.assign(cache_size_per_layer, 0.0f);
36 } catch (const std::bad_alloc& e) {
37 Logger::error("Failed to allocate CPU KVCache for layer " +
38 std::to_string(l) + ": " + e.what());
39 throw;
40 }
41 }
42 Logger::info("KVCache (CPU) vectors allocated for " +
43 std::to_string(total_num_model_layers) + " layers.");
44
45#ifdef HAS_CUDA
46 this->allocated_num_layers = num_gpu_layers_to_allocate;
47 this->allocated_max_seq_len = max_seq_len_arg;
48 this->allocated_num_kv_heads = num_kv_heads;
49 this->allocated_head_dim = head_dim;
50
51 if (num_gpu_layers_to_allocate > 0) {
52 if (num_gpu_layers_to_allocate > total_num_model_layers) {
53 Logger::warning("KVCache::initialize: num_gpu_layers_to_allocate (" + std::to_string(num_gpu_layers_to_allocate) +
54 ") > total_num_model_layers (" + std::to_string(total_num_model_layers) +
55 "). Clamping to total_num_model_layers.");
56 this->allocated_num_layers = total_num_model_layers;
57 num_gpu_layers_to_allocate = total_num_model_layers;
58 }
59
60 size_t cache_elems_per_layer_gpu = static_cast<size_t>(max_seq_len_arg) *
61 static_cast<size_t>(num_kv_heads) *
62 static_cast<size_t>(head_dim);
63
64 size_t fp32_cache_bytes_per_layer_gpu = cache_elems_per_layer_gpu * sizeof(float);
65 size_t int8_cache_bytes_per_layer_gpu = cache_elems_per_layer_gpu * sizeof(int8_t);
66 size_t num_scales_per_layer_gpu = static_cast<size_t>(max_seq_len_arg) * static_cast<size_t>(num_kv_heads);
67 size_t scales_bytes_per_layer_gpu = num_scales_per_layer_gpu * sizeof(float);
68
69 if (cache_elems_per_layer_gpu == 0 && config.use_kvcache_quantization) {
70 throw std::runtime_error(
71 "KVCache (CUDA INT8): Calculated cache elements per layer is zero. Check parameters.");
72 } else if (cache_elems_per_layer_gpu == 0) {
73 throw std::runtime_error(
74 "KVCache (CUDA FP32): Calculated cache elements per layer is zero. Check parameters.");
75 }
76
77 if (config.use_kvcache_quantization) {
78 Logger::info("Allocating INT8 KVCache + FP32 Scales on GPU for " + std::to_string(num_gpu_layers_to_allocate) +
79 " layers. Data size per layer: " +
80 std::to_string(int8_cache_bytes_per_layer_gpu / (1024.0 * 1024.0)) +
81 " MB. Scales size per layer: " +
82 std::to_string(scales_bytes_per_layer_gpu / (1024.0 * 1024.0)) + " MB");
83 } else {
84 Logger::info("Allocating FP32 KVCache on GPU for " + std::to_string(num_gpu_layers_to_allocate) +
85 " layers, size per layer: " +
86 std::to_string(fp32_cache_bytes_per_layer_gpu / (1024.0 * 1024.0)) +
87 " MB");
88 }
89
90 int gpu_layer_start_model_idx = this->total_model_layers_ - num_gpu_layers_to_allocate;
91 Logger::info("KVCache GPU allocation will target model layers from index " + std::to_string(gpu_layer_start_model_idx) +
92 " to " + std::to_string(gpu_layer_start_model_idx + num_gpu_layers_to_allocate - 1));
93
94 for (int i = 0; i < num_gpu_layers_to_allocate; ++i) {
95 int current_model_idx_for_gpu = gpu_layer_start_model_idx + i;
96
97 if (current_model_idx_for_gpu < 0 || static_cast<size_t>(current_model_idx_for_gpu) >= layers.size()) {
98 Logger::error("KVCache::initialize: Calculated current_model_idx_for_gpu (" + std::to_string(current_model_idx_for_gpu) + ") is out of bounds for layers vector (size " + std::to_string(layers.size()) + "). Skipping this layer.");
99 continue;
100 }
101
102 if (layers[current_model_idx_for_gpu].k_dev_fp32) {
104 "KVCache::initialize: Re-initializing KVCache layer " + std::to_string(current_model_idx_for_gpu) + " K dev fp32 pointer without proper destruction?");
105 gpuErrchk(cudaFree(layers[current_model_idx_for_gpu].k_dev_fp32));
106 layers[current_model_idx_for_gpu].k_dev_fp32 = nullptr;
107 }
108 if (layers[current_model_idx_for_gpu].v_dev_fp32) {
110 "KVCache::initialize: Re-initializing KVCache layer " + std::to_string(current_model_idx_for_gpu) + " V dev fp32 pointer without proper destruction?");
111 gpuErrchk(cudaFree(layers[current_model_idx_for_gpu].v_dev_fp32));
112 layers[current_model_idx_for_gpu].v_dev_fp32 = nullptr;
113 }
114 if (layers[current_model_idx_for_gpu].k_dev_quantized) {
116 "KVCache::initialize: Re-initializing KVCache layer " + std::to_string(current_model_idx_for_gpu) + " K dev quantized pointer without proper destruction?");
117 gpuErrchk(cudaFree(layers[current_model_idx_for_gpu].k_dev_quantized));
118 layers[current_model_idx_for_gpu].k_dev_quantized = nullptr;
119 }
120 if (layers[current_model_idx_for_gpu].v_dev_quantized) {
122 "KVCache::initialize: Re-initializing KVCache layer " + std::to_string(current_model_idx_for_gpu) + " V dev quantized pointer without proper destruction?");
123 gpuErrchk(cudaFree(layers[current_model_idx_for_gpu].v_dev_quantized));
124 layers[current_model_idx_for_gpu].v_dev_quantized = nullptr;
125 }
126 if (layers[current_model_idx_for_gpu].k_dev_scales) {
128 "KVCache::initialize: Re-initializing KVCache layer " + std::to_string(current_model_idx_for_gpu) + " K dev scales pointer without proper destruction?");
129 gpuErrchk(cudaFree(layers[current_model_idx_for_gpu].k_dev_scales));
130 layers[current_model_idx_for_gpu].k_dev_scales = nullptr;
131 }
132 if (layers[current_model_idx_for_gpu].v_dev_scales) {
134 "KVCache::initialize: Re-initializing KVCache layer " + std::to_string(current_model_idx_for_gpu) + " V dev scales pointer without proper destruction?");
135 gpuErrchk(cudaFree(layers[current_model_idx_for_gpu].v_dev_scales));
136 layers[current_model_idx_for_gpu].v_dev_scales = nullptr;
137 }
138
139 if (config.use_kvcache_quantization) {
140 gpuErrchk(cudaMalloc(&layers[current_model_idx_for_gpu].k_dev_quantized, int8_cache_bytes_per_layer_gpu));
141 gpuErrchk(cudaMalloc(&layers[current_model_idx_for_gpu].v_dev_quantized, int8_cache_bytes_per_layer_gpu));
142 gpuErrchk(cudaMalloc(&layers[current_model_idx_for_gpu].k_dev_scales, scales_bytes_per_layer_gpu));
143 gpuErrchk(cudaMalloc(&layers[current_model_idx_for_gpu].v_dev_scales, scales_bytes_per_layer_gpu));
144
145 gpuErrchk(cudaMemset(layers[current_model_idx_for_gpu].k_dev_quantized, 0, int8_cache_bytes_per_layer_gpu));
146 gpuErrchk(cudaMemset(layers[current_model_idx_for_gpu].v_dev_quantized, 0, int8_cache_bytes_per_layer_gpu));
147 gpuErrchk(cudaMemset(layers[current_model_idx_for_gpu].k_dev_scales, 0, scales_bytes_per_layer_gpu));
148 gpuErrchk(cudaMemset(layers[current_model_idx_for_gpu].v_dev_scales, 0, scales_bytes_per_layer_gpu));
149 } else {
150 gpuErrchk(cudaMalloc(&layers[current_model_idx_for_gpu].k_dev_fp32, fp32_cache_bytes_per_layer_gpu));
151 gpuErrchk(cudaMalloc(&layers[current_model_idx_for_gpu].v_dev_fp32, fp32_cache_bytes_per_layer_gpu));
152 gpuErrchk(cudaMemset(layers[current_model_idx_for_gpu].k_dev_fp32, 0, fp32_cache_bytes_per_layer_gpu));
153 gpuErrchk(cudaMemset(layers[current_model_idx_for_gpu].v_dev_fp32, 0, fp32_cache_bytes_per_layer_gpu));
154 }
155 }
156 Logger::info("KVCache GPU allocation and zeroing complete for " + std::to_string(num_gpu_layers_to_allocate) + " layers.");
157 } else {
158 Logger::info("KVCache: No GPU layers requested for allocation (num_gpu_layers_to_allocate is 0). Skipping GPU KVCache allocation.");
159 this->allocated_num_layers = 0;
160 }
161
162#else
163 Logger::info("KVCache (CPU-only build) initialized with dimensions for " +
164 std::to_string(total_num_model_layers) + " layers, " +
165 std::to_string(max_seq_len_arg) + " seq len, " +
166 std::to_string(num_kv_heads) + " KV heads, " +
167 std::to_string(head_dim) + " head dim");
168#endif
169}
static void warning(const std::string &message)
Definition logger.cpp:139
static void info(const std::string &message)
Definition logger.cpp:135
static void error(const std::string &message)
Definition logger.cpp:143
int max_batch_size
Definition model.h:159
int max_seq_len_config_
Definition model.h:163
int total_model_layers_
Definition model.h:162
bool use_kvcache_quantization
Definition model.h:103

References batch_seq_lens, current_batch_size, Logger::error(), Logger::info(), layers, max_batch_size, max_seq_len_config_, seq_len, total_model_layers_, ModelConfig::use_kvcache_quantization, and Logger::warning().

Referenced by tinyllama::TinyLlamaSession::TinyLlamaSession().

◆ initialize_batch()

void KVCache::initialize_batch ( int  batch_size)
inline

Initialize batch mode with specified number of sequences.

Parameters
batch_sizeNumber of sequences to process in batch

Definition at line 201 of file model.h.

201 {
202 if (batch_size > max_batch_size) {
203 Logger::warning("Requested batch size " + std::to_string(batch_size) +
204 " exceeds max batch size " + std::to_string(max_batch_size) +
205 ". Using max batch size.");
206 batch_size = max_batch_size;
207 }
208 current_batch_size = batch_size;
209 batch_seq_lens.resize(batch_size, 0);
210 }

References batch_seq_lens, current_batch_size, max_batch_size, and Logger::warning().

Referenced by tinyllama::TinyLlamaSession::generate_batch().

Member Data Documentation

◆ batch_seq_lens

std::vector<int> KVCache::batch_seq_lens

Sequence lengths for each sequence in batch

Definition at line 158 of file model.h.

Referenced by clear_data(), TinyLlamaModel::forward_cpu_batch_generation(), initialize(), and initialize_batch().

◆ current_batch_size

int KVCache::current_batch_size = 0

Current number of active sequences

Definition at line 160 of file model.h.

Referenced by clear_data(), TinyLlamaModel::forward_cpu_batch_generation(), initialize(), and initialize_batch().

◆ layers

std::vector<KVCacheLayer> KVCache::layers

◆ max_batch_size

int KVCache::max_batch_size = 1

Maximum number of sequences that can be cached

Definition at line 159 of file model.h.

Referenced by initialize(), initialize_batch(), update_kv_cache_batch_cpu(), and update_kv_cache_batch_cpu_sequence_aware().

◆ max_seq_len_config_

int KVCache::max_seq_len_config_ = 0

◆ seq_len

int KVCache::seq_len = 0

◆ total_model_layers_

int KVCache::total_model_layers_ = 0

Total number of layers in the model

Definition at line 162 of file model.h.

Referenced by initialize().


The documentation for this struct was generated from the following files: