llama.cpp
llama.cpp copied to clipboard
Research: Benchmarking DeepSeek-R1 IQ1_S 1.58bit
Research Stage
- [ ] Background Research (Let's try to avoid reinventing the wheel)
- [ ] Hypothesis Formed (How do you think this will work and it's effect?)
- [ ] Strategy / Implementation Forming
- [x] Analysis of results
- [ ] Debrief / Documentation (So people in the future can learn from us)
Previous existing literature and research
Command
./llama.cpp/build/bin/llama-cli \
--model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
--cache-type-k q4_0 \
--threads 12 -no-cnv --n-gpu-layers 61 --prio 2 \
--temp 0.6 \
--ctx-size 8192 \
--seed 3407 \
--prompt "<|User|>What is the capital of Italy?<|Assistant|>"
Model
DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S 1.58Bit, 131GB
Hardware
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:27:00.0 Off | 0 |
| N/A 34C P0 58W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:2A:00.0 Off | 0 |
| N/A 32C P0 60W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Hypothesis
Reported performances is 140 token/second
Implementation
No response
Analysis
Llama.cpp Performance Analysis
Raw Benchmarks
llama_perf_sampler_print: sampling time = 2.45 ms / 35 runs ( 0.07 ms per token, 14297.39 tokens per second)
llama_perf_context_print: load time = 20988.11 ms
llama_perf_context_print: prompt eval time = 1233.88 ms / 10 tokens ( 123.39 ms per token, 8.10 tokens per second)
llama_perf_context_print: eval time = 2612.63 ms / 24 runs ( 108.86 ms per token, 9.19 tokens per second)
llama_perf_context_print: total time = 3869.00 ms / 34 tokens
Detailed Analysis
1. Token Sampling Performance
- Total Time: 2.45 ms for 35 runs
- Per Token: 0.07 ms
- Speed: 14,297.39 tokens per second
- Description: This represents the speed at which the model can select the next token after processing. This is extremely fast compared to the actual generation speed, as it only involves the final selection process.
2. Model Loading
- Total Time: 20,988.11 ms (≈21 seconds)
- Description: One-time initialization cost to load the model into memory. This happens only at startup and doesn't affect ongoing performance.
3. Prompt Evaluation
- Total Time: 1,233.88 ms for 10 tokens
- Per Token: 123.39 ms
- Speed: 8.10 tokens per second
- Description: Initial processing of the prompt is slightly slower than subsequent token generation, as it needs to establish the full context for the first time.
4. Generation Evaluation
- Total Time: 2,612.63 ms for 24 runs
- Per Token: 108.86 ms
- Speed: 9.19 tokens per second
- Description: This represents the actual speed of generating new tokens, including all neural network computations.
5. Total Processing Time
- Total Time: 3,869.00 ms
- Tokens Processed: 34 tokens
- Average Speed: ≈8.79 tokens per second
Key Insights
-
Performance Bottlenecks:
- The main bottleneck is in the evaluation phase (actual token generation)
- While sampling can handle 14K+ tokens per second, actual generation is limited to about 9 tokens per second
- This difference highlights that the neural network computations, not the token selection process, are the limiting factor
-
Processing Stages:
- Model loading is a significant but one-time cost
- Prompt evaluation is slightly slower than subsequent token generation
- Sampling is extremely fast compared to evaluation
-
Overall Performance:
- The system demonstrates typical performance characteristics for a CPU-based language model
- The total processing rate of ~9 tokens per second is reasonable for local inference on consumer hardware
Relevant log output