fast-llama
fast-llama copied to clipboard
Runs LLaMA with Extremely HIGH speed
Fast-LLaMA: A High-Performance Inference Engine
Descriptions
fast-llama is a super high-performance inference engine for LLMs like LLaMA (2.5x of llama.cpp) written in pure C++. It can run a 8-bit quantized LLaMA2-7B model on a cpu with 56 cores in speed of ~25 tokens / s. It outperforms all current open-source inference engines, especially when compared to the renowned llama.cpp, with ~2.5 times better inference speed on a CPU.
Features
| Feature Name | Current Support | Future Suport |
|---|---|---|
| Model Types | ✅LLaMA2 | Others LLMs like Baichuan, StableDiffusion |
| Quantization | ✅INT16, ✅INT8 | INT4 |
| Model Formats | ✅HuggingFace, ✅gguf(by llama.cpp), ✅flm | |
| Systems | ✅Linux, ✅Windows | Macbook, Android, iOS |
| CPU/GPU | ✅X86/64 CPU | ARM, Apple Mx CPUs, GPU, CPU+GPU |
| Architectures | ✅UMA, ✅NUMA |
Advantages
Why you should use Fast-LLaMA?
Fast- Extremely fast on CPU.
Fasterthan any other engines on Github including llama.cpp.
- Extremely fast on CPU.
Simple- Totally less than 7k lines of C++ codes with well-orgnized code structures and no dependencies except NUMA (if needed for multi-cpus).
"Easy To Use"(target ☺️)
Quick Start
Compile
Only Linux is supported currently. Support of other platforms including Windows, Mac, GPU is coming soon.
Requirements
GCC 10.xor newer versionslibnuma-devif your computer has more than one physical CPUsLinux Kernel v5.xor higher is needed for NUMA
Compilation
Method 1. Using the provided build script:
bash ./build.sh
Method 2. Using Make:
make -j 4
Run
1. Run with llama2.c models:
Step 1: Download a model
See llama2.c
Step 2: Run the model
./main -c ./models/stories110M.bin -z ./models/tokenizer.bin -j 14 -q int8 -n 200 -i 'That was a long long story happened in the ancient China.'
2. Run with hugging face format models
Step 1: Download a model
Step 2: Convert the model info FLM format
python3 ./tools/convert_flm.py -m /path/to/model-directory -o ./models/model-name-int8.flm -t int8
Step 3: Run the model
./main -c ./models/model-name-int8.flm -j 40 -n 200 -i 'That was a long long story happened in the ancient China.'
All supported command-line options are as follows:
-c: Path to the model file-f: Model file format (e.g., gguf)-j: Number of threads to use (e.g., 56)-q: Quantization mode (e.g., int8)-n: Number of tokens to generate (e.g., 200)-i: Input text (e.g., 'That was a long long story happened in the ancient China.')-h: show usage information
Performance
Below are some incomplete test results
Testing Result:
| Model | Model Size | OutputSpeed/8 threads |
OutputSpeed/28 threads |
OutputSpeed/56 threads |
|---|---|---|---|---|
| stories110M | 110M | 237tps |
400tps |
440tps |
| Chinese-LLaMA-1.3B | 1.3B | 38.9tps |
127tps |
155tps |
| Chinese-LLaMA-7B | 7B | 7.4tps |
17.4tps |
23.5tps |
- Note: tps = tokens / second
Testing Conditions
- Testing Prompt: "That was a long long story happened in the ancient Europe. It was about a brave boy name Oliver. Oliver lived in a small village among many big moutains. It was a beautiful village."
- Quantization:
int8 - NUMA:
2sockets- Note: Make sure that NUMA is truely available if you expect to accelerate with NUMA)
- System: (
uname -a)Linux coderlsf 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux - CPU:
56physical cores,AVX-512
Architecture: x86_64
Model name: Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
CPU(s): 112 (56 physical cores)
Thread(s) per core: 2
Core(s) per socket: 28
Socket(s): 2
Latancy of first token will be optimized laterly.
Why
Why is it so fast?
- Ultimate memory efficiency
- Zero memory allocations and frees during inferencing.
- Maximization of memory locality.
- Well-designed thread scheduling algorithm
- Optimized operators
- Fuse all operators that can be fused together
- Optmize calculation of several operators
- Proper Quantizations
License
fast-llama is licensed under the MIT.
Acknowledgements
Special thanks to AlpinDale for his professional, meticulous, and patient guidance and assistance.
Contact
Email: 📩[email protected]
Contact me if you any questions.