|
TinyLlama.cpp 1.0
A lightweight C++ implementation of the TinyLlama language model
|
This codebase supports inference for Llama 2/3 architecture models using both GGUF and Safetensors formats, and additionally contains Python bindings.
The GGUF format support includes loading models with various tensor types such as BF16, FP16, FP32, and the quantised types: Q4_K_M, Q6_K, Q8_0, and Q8_K.
Note for Older Llama Models (Llama 2/TinyLlama): Some older GGUF files may not contain explicit BPE merge rules. The system automatically handles this by generating merge rules from vocabulary and token scores (similar to llama.cpp's approach), ensuring proper tokenization without requiring external files.
cpp-httplib) for easy interaction via a web UI.This section outlines the necessary components to build and run the project, and how to obtain model files.
Core requirements to build and run the C++ application:
sudo apt update && sudo apt install build-essential cmake libboost-all-dev libomp-devlibboost-all-dev is simplest. For minimal, ensure libboost-regex-dev and Boost Xpressive headers are installed.sudo dnf install gcc-c++ cmake boost-devel libgomp (or libomp-devel).brew install cmake boost llvm (llvm for OpenMP; may need extra flags if clang isn't finding OpenMP).vcpkg install boost-regex nlohmann-json cpp-httplib (OpenMP usually included with MSVC).boost-headers or a full boost package if boost-regex alone is insufficient.choco install cmake visualstudio2022buildtools boost-msvc-14.3 (or similar for your VS and Boost versions).libomp-dev on Debian, libgomp on Fedora, from llvm on macOS, or part of MSVC). Performance will be lower without it.nvcc is in your PATH.sudo apt install nvidia-cuda-toolkit libcublas-dev can be used after NVIDIA drivers are set up.-DHAS_CUDA=ON) will detect it. nvcc (compiler) and cublas (library) are key.To run the model, you need both the model weights and tokenizer information. These should be placed in an accessible directory (e.g., data/ or models/ within your project structure).
config.json: Contains the model architecture, hyperparameters, and other metadata.tokenizer.json: Defines the vocabulary, merge rules, and other tokenizer configurations. Required for SafeTensors format.model.safetensors: The file containing the model weights.F32, BF16, and F16 weight types from SafeTensors. BF16 and F16 are automatically converted to F32 upon loading. Internal computation then proceeds in F32..gguf).tokenizer_path to the same path as the model file, or omit it entirely in Python bindings.FP32, FP16, BF16, and common quantized types like Q4_K_M, Q6_K, Q8_0, Q8_K, etc., as supported by the underlying GGUF parsing library.It's recommended to download models from reputable sources like Hugging Face. Here are some examples that have been tested:
config.json, tokenizer.json, model.safetensors).gguf file)tokenizer.json and config.json.Use CMake to build the project. Navigate to the project root directory in your terminal.
This process will create executables in the build/ directory (or a subdirectory like build/Release/ on Windows with MSVC). Key executables include:
tinyllama: The main command-line interface for direct interaction (chat/prompt modes).tinyllama_server: A web server for interacting with SafeTensors models via a UI (this may be merged or evolve depending on project direction).Running the tinyllama Executable:
Key Command-Line Arguments for tinyllama:
<model_path>: Path to model file (.gguf) or directory (SafeTensors).<tokenizer_path>: Path to tokenizer file.tokenizer.json)<num_threads>: Number of CPU threads for generation.<prompt|chat>: prompt for single generation, chat for interactive mode.--system-prompt "<text>" (Optional): System-level instruction.initial_user_prompt (Optional): First user message or main prompt text.--max-tokens <N> (Optional): Max new tokens to generate (Default: 256).--n-gpu-layers <N> (Optional): Layers to offload to GPU (-1 for all, 0 for none. Default: -1).--use-mmap <true|false> (Optional): Memory-map GGUF files (Default: true).--temperature <F> (Optional): Sampling temperature (e.g., 0.1. Default: 0.1).--top-k <N> (Optional): Top-K sampling parameter (0 to disable). Default: 40.--top-p <F> (Optional): Top-P/nucleus sampling parameter (0.0-1.0). Default: 0.9.--use-kv-quant <true|false> (Optional): Use INT8 KVCache on GPU (Default: false).--use-batch-generation <true|false> (Optional): Enable single-token batch generation (Default: false).--max-batch-size <N> (Optional): Maximum number of sequences for multi-prompt batch processing (Default: 1).Note on Sampling Parameters: The tinyllama executable supports --temperature, --top-k, and --top-p via command line for full control over text generation sampling.
Example Invocation:
For detailed operational logs, inspect debugging.log in the application's runtime directory.
Development Installation (CPU-only):
Development Installation with CUDA Support:
Development Installation with PyTorch Dependencies:
Editable Development Installation:
Prerequisites for CUDA Build:
Usage:
For ease of use, comprehensive scripts are provided in the project root to automate common development and project tasks. These scripts simplify building, cleaning, running the applications, formatting code, generating documentation, and packaging releases.
First, make the script executable:
Key Command Options (refer to ./manage.sh help for all options):
./manage.sh build [--build-type <Release|Debug>] [--cuda <ON|OFF>]./manage.sh run-server [--model-dir <path>] [--tokenizer <path>] [--threads <num>] [--host <hostname>] [--port <num>] [--n-gpu-layers <num>] [--mmap <true|false>] [--no-log]./manage.sh run-chat [--model-dir <path>] [--tokenizer <path>] [--threads <num>] [--system-prompt <text>] [--prompt <text>] [--steps <num>] [--n-gpu-layers <num>] [--mmap <true|false>]run-chat specific sampling parameters like temperature, top-k, top-p are set to defaults in the C++ main.)./manage.sh run-prompt [--model-dir <path>] [--tokenizer <path>] [--prompt <text>] [--steps <num>] [--threads <num>] [--n-gpu-layers <num>] [--mmap <true|false>][--temperature <num>]--model-dir is not provided, you can specify the model directory/GGUF file path as a single positional argument after run-prompt../manage.sh run-prompt path/to/your/model --prompt "Translate to French: Hello"./manage.sh install [--gpu|--cpu]--cpu (default): CPU-only installation--gpu: Installation with CUDA support (requires CUDA toolkit)./manage.sh install --gpu for GPU support or ./manage.sh install --cpu for CPU-onlyIt is recommended to use this script for most routine operations. For detailed options for each command, please run ./manage.sh help.
This script provides equivalent functionality to manage.sh for Windows users.
Running the script (example):
Key Command Options (refer to .\\manage.ps1 help for all options):
.\\manage.ps1 build [-BuildType <Release|Debug>] [-Cuda <ON|OFF>].\\manage.ps1 run-server [-ModelDir <path>] [-TokenizerPath <path>] [-Threads <num>] [-Host <hostname>] [-Port <num>] [-NGpuLayers <num>] [-Mmap <$true|$false>] [-NoLog].\\manage.ps1 run-chat [-ModelDir <path>] [-TokenizerPath <path>] [-Threads <num>] [-SystemPrompt <text>] [-Prompt <text>] [-Steps <num>] [-NGpuLayers <num>] [-Mmap <$true|$false>]run-chat specific sampling parameters like temperature, top-k, top-p are set to defaults in the C++ main.).\\manage.ps1 run-prompt [-ModelDir <path>] [-TokenizerPath <path>] [-Prompt <text>] [-Steps <num>] [-Threads <num>] [-NGpuLayers <num>] [-Mmap <$true|$false>][-Temperature <num>]-ModelDir is not provided, you can specify the model directory/GGUF file path as a single positional argument after run-prompt..\\manage.ps1 run-prompt -ModelDir path\\to\\your\\model -Prompt "What is the capital of France?".\\manage.ps1 install [-Gpu|-Cpu]-Cpu (default): CPU-only installation-Gpu: Installation with CUDA support (requires CUDA toolkit).\\manage.ps1 install -Gpu for GPU support or .\\manage.ps1 install -Cpu for CPU-onlyFor detailed options for each command, run .\\manage.ps1 help.
The main way to use this project is via the web server:
./data with the actual path to the directory containing your config.json, tokenizer.json, and model.safetensors.http://localhost:8080 by default.http://localhost:8080.For users interested in a Python-based reference or for direct PyTorch inference with SafeTensors models (compatible with Llama 2 / TinyLlama architecture), a dedicated implementation is available in the pytorch/ directory.
This directory contains:
run_inference.py: The main script to execute inference using PyTorch.tinyllama.py: Contains the PyTorch model definition (e.g., for TinyLlama).utils.py: Utility helper functions.requirements.txt: Lists the necessary Python packages to run the PyTorch inference scripts. Install these using pip install -r pytorch/requirements.txt.README.md: A dedicated README within the pytorch/ directory provides more specific instructions on how to set up and run the PyTorch-based inference.This can be useful for:
Please refer to the pytorch/README.md for detailed usage instructions for this PyTorch implementation.
CMakeLists.txt: Main build configuration defining dependencies, targets, and compilation options.pyproject.toml: Modern Python packaging configuration with optional dependencies for GPU and PyTorch support.manage.sh: Comprehensive management script for Linux/macOS (build, clean, run, format, docs, etc.).manage.ps1: Windows PowerShell equivalent of the management script..clang-format: Code formatting configuration for consistent C++ style.Doxyfile: Doxygen configuration for generating API documentation.README.md: This comprehensive documentation file.main.cpp**: Command-line interface entry point for tinyllama executable.server.cpp**: HTTP server implementation for web UI interaction (tinyllama_server executable).api.cpp/api.h**: High-level TinyLlamaSession API for model loading and text generation.bindings.cpp**: Python bindings using pybind11 with comprehensive documentation for help() support.model.cpp/model.h**: Core Transformer architecture (attention, feed-forward, RoPE, etc.) with SIMD optimizations.model_constants.h**: Architecture constants and configuration parameters.model_macros.h**: Utility macros for cross-platform compatibility and safe operations.gguf_structs.h**: Data structures and type definitions for GGUF format parsing.ggml_types.h**: Type definitions compatible with GGML format specifications.tokenizer.cpp/tokenizer.h**: BPE tokenization, chat template application, and multi-format tokenizer support.safetensors_loader.cpp/safetensors_loader.h**: SafeTensors format parsing and tensor loading.gguf_parser.cpp/gguf_parser.h**: GGUF format parsing with support for various quantizations.quantization.cpp/quantization.h**: Quantization utilities and dequantization routines.utils.cpp/utils.h**: General utility functions for string processing, file operations, and helper routines.cuda_kernels.cu/cuda_kernels.h**: CUDA kernels for GPU-accelerated inference.logger.cpp/logger.h**: Logging utilities with GPU memory monitoring.tinyllama_cpp/**: Python package directory__init__.py: Package initialization with dynamic versioning and error handling._version.py: Auto-generated version file (created during build).pytorch/**: Pure PyTorch implementation for comparison and experimentationrun_inference.py: Main PyTorch inference script.tinyllama.py: PyTorch model definition.utils.py: Utility functions for PyTorch implementation.requirements.txt: PyTorch-specific dependencies.README.md: PyTorch implementation documentation.examples/**: Example scripts and usage demonstrationswww/**: Web interface assets for the HTTP serverdocs/**: Generated documentation and additional guidesbuild/**: CMake build directory (created during compilation)tinyllama, tinyllama_server_skbuild/**: Python build artifacts (created during pip install)debugging.log**: Runtime debugging output (created during execution)