|
TinyLlama.cpp 1.0
A lightweight C++ implementation of the TinyLlama language model
|
Model configuration structure holding architecture and hyperparameters. More...
#include <model.h>

Public Types | |
| enum class | TokenizerFamily { UNKNOWN , LLAMA_SENTENCEPIECE , LLAMA3_TIKTOKEN } |
Public Attributes | |
| int | hidden_size |
| int | intermediate_size |
| int | num_attention_heads |
| int | num_key_value_heads |
| int | num_hidden_layers |
| int | vocab_size |
| int | max_position_embeddings |
| float | rms_norm_eps |
| float | rope_theta |
| std::string | hidden_act |
| std::string | torch_dtype |
| int | bos_token_id |
| int | eos_token_id |
| int | unk_token_id = -1 |
| int | pad_token_id = -1 |
| std::string | architecture |
| std::string | model_name |
| std::string | chat_template_type |
| std::string | pre_tokenizer_type |
| std::string | chat_template_string |
| bool | is_gguf_file_loaded |
| bool | use_mmap_for_gguf = true |
| bool | use_kvcache_quantization = false |
| int | num_cpu_offload_layers = 0 |
| bool | enable_memory_efficient_layers = true |
| bool | enable_prefill_chunking = true |
| bool | use_optimized_cuda_kernels = true |
| TokenizerFamily | tokenizer_family = TokenizerFamily::UNKNOWN |
Model configuration structure holding architecture and hyperparameters.
Contains all key parameters needed to construct and run a transformer model, including hidden size, number of layers, attention heads, vocabulary size, special token IDs, etc.
|
strong |
| std::string ModelConfig::architecture |
Model architecture identifier
Definition at line 96 of file model.h.
Referenced by SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), and TinyLlamaModel::TinyLlamaModel().
| int ModelConfig::bos_token_id |
Beginning of sequence token ID
Definition at line 92 of file model.h.
Referenced by SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), Tokenizer::Tokenizer(), and Tokenizer::Tokenizer().
| std::string ModelConfig::chat_template_string |
Template string for chat formatting
Definition at line 100 of file model.h.
Referenced by parse_model_config_from_gguf(), and PYBIND11_MODULE().
| std::string ModelConfig::chat_template_type |
Type of chat template used
Definition at line 98 of file model.h.
Referenced by parse_model_config_from_gguf(), and PYBIND11_MODULE().
| bool ModelConfig::enable_memory_efficient_layers = true |
Enable automatic layer weight eviction during forward pass
Definition at line 107 of file model.h.
Referenced by TinyLlamaModel::forward().
| int ModelConfig::eos_token_id |
End of sequence token ID
Definition at line 93 of file model.h.
Referenced by SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), tinyllama::TinyLlamaSession::TinyLlamaSession(), Tokenizer::Tokenizer(), and Tokenizer::Tokenizer().
| std::string ModelConfig::hidden_act |
Activation function in hidden layers
Definition at line 90 of file model.h.
Referenced by parse_model_config(), parse_model_config_from_gguf(), and PYBIND11_MODULE().
| int ModelConfig::hidden_size |
Size of the hidden layers
Definition at line 81 of file model.h.
Referenced by tinyllama::TinyLlamaSession::batch_generation_parallel(), tinyllama::TinyLlamaSession::batch_prefill_parallel(), TinyLlamaModel::ensure_down_proj_dequantized(), TinyLlamaModel::ensure_embed_tokens_dequantized(), TinyLlamaModel::ensure_gate_proj_dequantized(), TinyLlamaModel::ensure_k_proj_dequantized(), TinyLlamaModel::ensure_lm_head_dequantized(), TinyLlamaModel::ensure_o_proj_dequantized(), TinyLlamaModel::ensure_q_proj_dequantized(), TinyLlamaModel::ensure_up_proj_dequantized(), TinyLlamaModel::ensure_v_proj_dequantized(), TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), TinyLlamaModel::forward_cpu_logits_batch(), tinyllama::TinyLlamaSession::generate(), TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::initialize_rope_freqs(), TinyLlamaModel::initialize_weights(), SafeTensorsLoader::load_model_config_from_json(), TinyLlamaModel::lookup_embedding(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), TinyLlamaModel::smart_gemm_batch_cuda(), TinyLlamaModel::TinyLlamaModel(), and tinyllama::TinyLlamaSession::TinyLlamaSession().
| int ModelConfig::intermediate_size |
Size of the intermediate (feed-forward) layers
Definition at line 82 of file model.h.
Referenced by TinyLlamaModel::ensure_down_proj_dequantized(), TinyLlamaModel::ensure_gate_proj_dequantized(), TinyLlamaModel::ensure_up_proj_dequantized(), TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::initialize_weights(), SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), TinyLlamaModel::smart_gemm_batch_cuda(), and TinyLlamaModel::TinyLlamaModel().
| bool ModelConfig::is_gguf_file_loaded |
Flag indicating if model was loaded from GGUF format
Definition at line 101 of file model.h.
Referenced by TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), TinyLlamaModel::forward_cpu_logits_batch(), SafeTensorsLoader::load_model_config_from_json(), main(), PYBIND11_MODULE(), TinyLlamaModel::TinyLlamaModel(), TinyLlamaModel::TinyLlamaModel(), TinyLlamaModel::TinyLlamaModel(), and tinyllama::TinyLlamaSession::TinyLlamaSession().
| int ModelConfig::max_position_embeddings |
Maximum sequence length supported
Definition at line 87 of file model.h.
Referenced by TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), tinyllama::TinyLlamaSession::generate(), TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::initialize_rope_freqs(), SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), TinyLlamaModel::TinyLlamaModel(), and tinyllama::TinyLlamaSession::TinyLlamaSession().
| std::string ModelConfig::model_name |
Name of the model
Definition at line 97 of file model.h.
Referenced by SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), and PYBIND11_MODULE().
| int ModelConfig::num_attention_heads |
Number of attention heads
Definition at line 83 of file model.h.
Referenced by TinyLlamaModel::ensure_k_proj_dequantized(), TinyLlamaModel::ensure_v_proj_dequantized(), TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::initialize_rope_freqs(), SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), TinyLlamaModel::smart_gemm_batch_cuda(), TinyLlamaModel::TinyLlamaModel(), and tinyllama::TinyLlamaSession::TinyLlamaSession().
| int ModelConfig::num_cpu_offload_layers = 0 |
Number of layers to offload to CPU
Definition at line 104 of file model.h.
Referenced by tinyllama::TinyLlamaSession::batch_generation_parallel(), tinyllama::TinyLlamaSession::batch_prefill_parallel(), TinyLlamaModel::forward(), TinyLlamaModel::forward_cpu_batch_generation(), tinyllama::TinyLlamaSession::generate(), TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::TinyLlamaModel(), TinyLlamaModel::TinyLlamaModel(), tinyllama::TinyLlamaSession::TinyLlamaSession(), and TinyLlamaModel::~TinyLlamaModel().
| int ModelConfig::num_hidden_layers |
Number of transformer layers
Definition at line 85 of file model.h.
Referenced by tinyllama::TinyLlamaSession::batch_generation_parallel(), tinyllama::TinyLlamaSession::batch_prefill_parallel(), TinyLlamaModel::forward(), tinyllama::TinyLlamaSession::generate(), TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::initialize_weights(), SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), TinyLlamaModel::smart_gemm_batch_cuda(), TinyLlamaModel::TinyLlamaModel(), TinyLlamaModel::TinyLlamaModel(), tinyllama::TinyLlamaSession::TinyLlamaSession(), and TinyLlamaModel::~TinyLlamaModel().
| int ModelConfig::num_key_value_heads |
Number of key/value heads for grouped-query attention
Definition at line 84 of file model.h.
Referenced by TinyLlamaModel::ensure_k_proj_dequantized(), TinyLlamaModel::ensure_v_proj_dequantized(), TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), TinyLlamaModel::initialize_gpu_and_rope(), SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), TinyLlamaModel::smart_gemm_batch_cuda(), TinyLlamaModel::TinyLlamaModel(), and tinyllama::TinyLlamaSession::TinyLlamaSession().
| int ModelConfig::pad_token_id = -1 |
Padding token ID, default to -1 if not specified
Definition at line 95 of file model.h.
Referenced by SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), and Tokenizer::Tokenizer().
| std::string ModelConfig::pre_tokenizer_type |
Type of pre-tokenizer
Definition at line 99 of file model.h.
Referenced by parse_model_config_from_gguf(), and PYBIND11_MODULE().
| float ModelConfig::rms_norm_eps |
Epsilon for RMSNorm operation
Definition at line 88 of file model.h.
Referenced by TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), TinyLlamaModel::forward_cpu_logits_batch(), SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), and PYBIND11_MODULE().
| float ModelConfig::rope_theta |
Base for rotary position embeddings
Definition at line 89 of file model.h.
Referenced by TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::initialize_rope_freqs(), SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), and PYBIND11_MODULE().
| TokenizerFamily ModelConfig::tokenizer_family = TokenizerFamily::UNKNOWN |
Definition at line 117 of file model.h.
Referenced by tinyllama::TinyLlamaSession::generate(), tinyllama::TinyLlamaSession::generate_batch(), SafeTensorsLoader::load_model_config_from_json(), main(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), and tinyllama::TinyLlamaSession::TinyLlamaSession().
| std::string ModelConfig::torch_dtype |
Data type used in the original PyTorch model
Definition at line 91 of file model.h.
Referenced by parse_model_config(), and PYBIND11_MODULE().
| int ModelConfig::unk_token_id = -1 |
Unknown token ID, default to -1 if not specified
Definition at line 94 of file model.h.
Referenced by SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), and Tokenizer::Tokenizer().
| bool ModelConfig::use_kvcache_quantization = false |
Whether to use INT8 quantization for KVCache on GPU
Definition at line 103 of file model.h.
Referenced by KVCache::initialize(), and tinyllama::TinyLlamaSession::TinyLlamaSession().
| bool ModelConfig::use_mmap_for_gguf = true |
Definition at line 102 of file model.h.
Referenced by TinyLlamaModel::TinyLlamaModel(), and tinyllama::TinyLlamaSession::TinyLlamaSession().
| int ModelConfig::vocab_size |
Size of the vocabulary
Definition at line 86 of file model.h.
Referenced by tinyllama::TinyLlamaSession::batch_prefill_parallel(), TinyLlamaModel::ensure_embed_tokens_dequantized(), TinyLlamaModel::ensure_lm_head_dequantized(), TinyLlamaModel::forward(), TinyLlamaModel::forward_cpu_batch_generation(), TinyLlamaModel::forward_cpu_logits_batch(), tinyllama::TinyLlamaSession::generate(), tinyllama::TinyLlamaSession::generate_batch(), TinyLlamaModel::get_vocab_size(), TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::initialize_weights(), SafeTensorsLoader::load_model_config_from_json(), TinyLlamaModel::lookup_embedding(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), and TinyLlamaModel::TinyLlamaModel().