TinyLlama.cpp: ModelConfig Struct Reference

TinyLlama.cpp 1.0

A lightweight C++ implementation of the TinyLlama language model

Loading...

Searching...

No Matches

Model configuration structure holding architecture and hyperparameters. More...

#include <model.h>

Collaboration diagram for ModelConfig:

Collaboration graph

Public Types
enum class	TokenizerFamily { UNKNOWN , LLAMA_SENTENCEPIECE , LLAMA3_TIKTOKEN }

Public Attributes
int	hidden_size

int	intermediate_size

int	num_attention_heads

int	num_key_value_heads

int	num_hidden_layers

int	vocab_size

int	max_position_embeddings

float	rms_norm_eps

float	rope_theta

std::string	hidden_act

std::string	torch_dtype

int	bos_token_id

int	eos_token_id

int	unk_token_id = -1

int	pad_token_id = -1

std::string	architecture

std::string	model_name

std::string	chat_template_type

std::string	pre_tokenizer_type

std::string	chat_template_string

bool	is_gguf_file_loaded

bool	use_mmap_for_gguf = true

bool	use_kvcache_quantization = false

int	num_cpu_offload_layers = 0

bool	enable_memory_efficient_layers = true

bool	enable_prefill_chunking = true

bool	use_optimized_cuda_kernels = true

TokenizerFamily	tokenizer_family = TokenizerFamily::UNKNOWN

Detailed Description

Model configuration structure holding architecture and hyperparameters.

Contains all key parameters needed to construct and run a transformer model, including hidden size, number of layers, attention heads, vocabulary size, special token IDs, etc.

Definition at line 80 of file model.h.

Member Enumeration Documentation

◆ TokenizerFamily

enum class ModelConfig::TokenizerFamily

strong

Enumerator
UNKNOWN
LLAMA_SENTENCEPIECE
LLAMA3_TIKTOKEN

Definition at line 112 of file model.h.

                               {
        UNKNOWN,
        LLAMA_SENTENCEPIECE, // For Llama 2 and similar SentencePiece BPE
        LLAMA3_TIKTOKEN      // For Llama 3's Tiktoken-based BPE
    };

Member Data Documentation

◆ architecture

std::string ModelConfig::architecture

Model architecture identifier

Definition at line 96 of file model.h.

Referenced by SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), and TinyLlamaModel::TinyLlamaModel().

◆ bos_token_id

int ModelConfig::bos_token_id

Beginning of sequence token ID

Definition at line 92 of file model.h.

Referenced by SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), Tokenizer::Tokenizer(), and Tokenizer::Tokenizer().

◆ chat_template_string

std::string ModelConfig::chat_template_string

Template string for chat formatting

Definition at line 100 of file model.h.

Referenced by parse_model_config_from_gguf(), and PYBIND11_MODULE().

◆ chat_template_type

std::string ModelConfig::chat_template_type

Type of chat template used

Definition at line 98 of file model.h.

Referenced by parse_model_config_from_gguf(), and PYBIND11_MODULE().

◆ enable_memory_efficient_layers

bool ModelConfig::enable_memory_efficient_layers = true

Enable automatic layer weight eviction during forward pass

Definition at line 107 of file model.h.

Referenced by TinyLlamaModel::forward().

◆ enable_prefill_chunking

bool ModelConfig::enable_prefill_chunking = true

Definition at line 109 of file model.h.

◆ eos_token_id

int ModelConfig::eos_token_id

End of sequence token ID

Definition at line 93 of file model.h.

Referenced by SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), tinyllama::TinyLlamaSession::TinyLlamaSession(), Tokenizer::Tokenizer(), and Tokenizer::Tokenizer().

◆ hidden_act

std::string ModelConfig::hidden_act

Activation function in hidden layers

Definition at line 90 of file model.h.

Referenced by parse_model_config(), parse_model_config_from_gguf(), and PYBIND11_MODULE().

◆ hidden_size

int ModelConfig::hidden_size

Size of the hidden layers

Definition at line 81 of file model.h.

Referenced by tinyllama::TinyLlamaSession::batch_generation_parallel(), tinyllama::TinyLlamaSession::batch_prefill_parallel(), TinyLlamaModel::ensure_down_proj_dequantized(), TinyLlamaModel::ensure_embed_tokens_dequantized(), TinyLlamaModel::ensure_gate_proj_dequantized(), TinyLlamaModel::ensure_k_proj_dequantized(), TinyLlamaModel::ensure_lm_head_dequantized(), TinyLlamaModel::ensure_o_proj_dequantized(), TinyLlamaModel::ensure_q_proj_dequantized(), TinyLlamaModel::ensure_up_proj_dequantized(), TinyLlamaModel::ensure_v_proj_dequantized(), TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), TinyLlamaModel::forward_cpu_logits_batch(), tinyllama::TinyLlamaSession::generate(), TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::initialize_rope_freqs(), TinyLlamaModel::initialize_weights(), SafeTensorsLoader::load_model_config_from_json(), TinyLlamaModel::lookup_embedding(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), TinyLlamaModel::smart_gemm_batch_cuda(), TinyLlamaModel::TinyLlamaModel(), and tinyllama::TinyLlamaSession::TinyLlamaSession().

◆ intermediate_size

int ModelConfig::intermediate_size

Size of the intermediate (feed-forward) layers

Definition at line 82 of file model.h.

Referenced by TinyLlamaModel::ensure_down_proj_dequantized(), TinyLlamaModel::ensure_gate_proj_dequantized(), TinyLlamaModel::ensure_up_proj_dequantized(), TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::initialize_weights(), SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), TinyLlamaModel::smart_gemm_batch_cuda(), and TinyLlamaModel::TinyLlamaModel().

◆ is_gguf_file_loaded

bool ModelConfig::is_gguf_file_loaded

Flag indicating if model was loaded from GGUF format

Definition at line 101 of file model.h.

Referenced by TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), TinyLlamaModel::forward_cpu_logits_batch(), SafeTensorsLoader::load_model_config_from_json(), main(), PYBIND11_MODULE(), TinyLlamaModel::TinyLlamaModel(), TinyLlamaModel::TinyLlamaModel(), TinyLlamaModel::TinyLlamaModel(), and tinyllama::TinyLlamaSession::TinyLlamaSession().

◆ max_position_embeddings

int ModelConfig::max_position_embeddings

Maximum sequence length supported

Definition at line 87 of file model.h.

Referenced by TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), tinyllama::TinyLlamaSession::generate(), TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::initialize_rope_freqs(), SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), TinyLlamaModel::TinyLlamaModel(), and tinyllama::TinyLlamaSession::TinyLlamaSession().

◆ model_name

std::string ModelConfig::model_name

Name of the model

Definition at line 97 of file model.h.

Referenced by SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), and PYBIND11_MODULE().

◆ num_attention_heads

int ModelConfig::num_attention_heads

Number of attention heads

Definition at line 83 of file model.h.

Referenced by TinyLlamaModel::ensure_k_proj_dequantized(), TinyLlamaModel::ensure_v_proj_dequantized(), TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::initialize_rope_freqs(), SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), TinyLlamaModel::smart_gemm_batch_cuda(), TinyLlamaModel::TinyLlamaModel(), and tinyllama::TinyLlamaSession::TinyLlamaSession().

◆ num_cpu_offload_layers

int ModelConfig::num_cpu_offload_layers = 0

Number of layers to offload to CPU

Definition at line 104 of file model.h.

Referenced by tinyllama::TinyLlamaSession::batch_generation_parallel(), tinyllama::TinyLlamaSession::batch_prefill_parallel(), TinyLlamaModel::forward(), TinyLlamaModel::forward_cpu_batch_generation(), tinyllama::TinyLlamaSession::generate(), TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::TinyLlamaModel(), TinyLlamaModel::TinyLlamaModel(), tinyllama::TinyLlamaSession::TinyLlamaSession(), and TinyLlamaModel::~TinyLlamaModel().

◆ num_hidden_layers

int ModelConfig::num_hidden_layers

Number of transformer layers

Definition at line 85 of file model.h.

Referenced by tinyllama::TinyLlamaSession::batch_generation_parallel(), tinyllama::TinyLlamaSession::batch_prefill_parallel(), TinyLlamaModel::forward(), tinyllama::TinyLlamaSession::generate(), TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::initialize_weights(), SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), TinyLlamaModel::smart_gemm_batch_cuda(), TinyLlamaModel::TinyLlamaModel(), TinyLlamaModel::TinyLlamaModel(), tinyllama::TinyLlamaSession::TinyLlamaSession(), and TinyLlamaModel::~TinyLlamaModel().

◆ num_key_value_heads

int ModelConfig::num_key_value_heads

Number of key/value heads for grouped-query attention

Definition at line 84 of file model.h.

Referenced by TinyLlamaModel::ensure_k_proj_dequantized(), TinyLlamaModel::ensure_v_proj_dequantized(), TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), TinyLlamaModel::initialize_gpu_and_rope(), SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), TinyLlamaModel::smart_gemm_batch_cuda(), TinyLlamaModel::TinyLlamaModel(), and tinyllama::TinyLlamaSession::TinyLlamaSession().

◆ pad_token_id

int ModelConfig::pad_token_id = -1

Padding token ID, default to -1 if not specified

Definition at line 95 of file model.h.

Referenced by SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), and Tokenizer::Tokenizer().

◆ pre_tokenizer_type

std::string ModelConfig::pre_tokenizer_type

Type of pre-tokenizer

Definition at line 99 of file model.h.

Referenced by parse_model_config_from_gguf(), and PYBIND11_MODULE().

◆ rms_norm_eps

float ModelConfig::rms_norm_eps

Epsilon for RMSNorm operation

Definition at line 88 of file model.h.

Referenced by TinyLlamaModel::forward(), CPUBatchProcessor::forward_cpu_batch(), TinyLlamaModel::forward_cpu_batch_generation(), TinyLlamaModel::forward_cpu_logits_batch(), SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), and PYBIND11_MODULE().

◆ rope_theta

float ModelConfig::rope_theta

Base for rotary position embeddings

Definition at line 89 of file model.h.

Referenced by TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::initialize_rope_freqs(), SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), and PYBIND11_MODULE().

◆ tokenizer_family

TokenizerFamily ModelConfig::tokenizer_family = TokenizerFamily::UNKNOWN

Definition at line 117 of file model.h.

Referenced by tinyllama::TinyLlamaSession::generate(), tinyllama::TinyLlamaSession::generate_batch(), SafeTensorsLoader::load_model_config_from_json(), main(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), and tinyllama::TinyLlamaSession::TinyLlamaSession().

◆ torch_dtype

std::string ModelConfig::torch_dtype

Data type used in the original PyTorch model

Definition at line 91 of file model.h.

Referenced by parse_model_config(), and PYBIND11_MODULE().

◆ unk_token_id

int ModelConfig::unk_token_id = -1

Unknown token ID, default to -1 if not specified

Definition at line 94 of file model.h.

Referenced by SafeTensorsLoader::load_model_config_from_json(), parse_model_config(), parse_model_config_from_gguf(), and Tokenizer::Tokenizer().

◆ use_kvcache_quantization

bool ModelConfig::use_kvcache_quantization = false

Whether to use INT8 quantization for KVCache on GPU

Definition at line 103 of file model.h.

Referenced by KVCache::initialize(), and tinyllama::TinyLlamaSession::TinyLlamaSession().

◆ use_mmap_for_gguf

bool ModelConfig::use_mmap_for_gguf = true

Definition at line 102 of file model.h.

Referenced by TinyLlamaModel::TinyLlamaModel(), and tinyllama::TinyLlamaSession::TinyLlamaSession().

◆ use_optimized_cuda_kernels

bool ModelConfig::use_optimized_cuda_kernels = true

Definition at line 110 of file model.h.

◆ vocab_size

int ModelConfig::vocab_size

Size of the vocabulary

Definition at line 86 of file model.h.

Referenced by tinyllama::TinyLlamaSession::batch_prefill_parallel(), TinyLlamaModel::ensure_embed_tokens_dequantized(), TinyLlamaModel::ensure_lm_head_dequantized(), TinyLlamaModel::forward(), TinyLlamaModel::forward_cpu_batch_generation(), TinyLlamaModel::forward_cpu_logits_batch(), tinyllama::TinyLlamaSession::generate(), tinyllama::TinyLlamaSession::generate_batch(), TinyLlamaModel::get_vocab_size(), TinyLlamaModel::initialize_gpu_and_rope(), TinyLlamaModel::initialize_weights(), SafeTensorsLoader::load_model_config_from_json(), TinyLlamaModel::lookup_embedding(), parse_model_config(), parse_model_config_from_gguf(), PYBIND11_MODULE(), and TinyLlamaModel::TinyLlamaModel().

The documentation for this struct was generated from the following file:

model.h