Research: Benchmarking DeepSeek-R1 IQ1_S 1.58bit

Open loretoparisi opened this issue 3 weeks ago • 42 comments

Research Stage

[ ] Background Research (Let's try to avoid reinventing the wheel)
[ ] Hypothesis Formed (How do you think this will work and it's effect?)
[ ] Strategy / Implementation Forming
[x] Analysis of results
[ ] Debrief / Documentation (So people in the future can learn from us)

Previous existing literature and research

Command

 ./llama.cpp/build/bin/llama-cli \
    --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    --cache-type-k q4_0 \
    --threads 12 -no-cnv --n-gpu-layers 61 --prio 2 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407 \
    --prompt "<｜User｜>What is the capital of Italy?<｜Assistant｜>"

Model

DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S 1.58Bit, 131GB

Hardware

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:27:00.0 Off |                    0 |
| N/A   34C    P0              58W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:2A:00.0 Off |                    0 |
| N/A   32C    P0              60W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Hypothesis

Reported performances is 140 token/second

Implementation

No response

Analysis

Llama.cpp Performance Analysis

Raw Benchmarks

llama_perf_sampler_print:    sampling time =       2.45 ms /    35 runs   (    0.07 ms per token, 14297.39 tokens per second)
llama_perf_context_print:        load time =   20988.11 ms
llama_perf_context_print: prompt eval time =    1233.88 ms /    10 tokens (  123.39 ms per token,     8.10 tokens per second)
llama_perf_context_print:        eval time =    2612.63 ms /    24 runs   (  108.86 ms per token,     9.19 tokens per second)
llama_perf_context_print:       total time =    3869.00 ms /    34 tokens

Detailed Analysis

1. Token Sampling Performance

Total Time: 2.45 ms for 35 runs
Per Token: 0.07 ms
Speed: 14,297.39 tokens per second
Description: This represents the speed at which the model can select the next token after processing. This is extremely fast compared to the actual generation speed, as it only involves the final selection process.

2. Model Loading

Total Time: 20,988.11 ms (≈21 seconds)
Description: One-time initialization cost to load the model into memory. This happens only at startup and doesn't affect ongoing performance.

3. Prompt Evaluation

Total Time: 1,233.88 ms for 10 tokens
Per Token: 123.39 ms
Speed: 8.10 tokens per second
Description: Initial processing of the prompt is slightly slower than subsequent token generation, as it needs to establish the full context for the first time.

4. Generation Evaluation

Total Time: 2,612.63 ms for 24 runs
Per Token: 108.86 ms
Speed: 9.19 tokens per second
Description: This represents the actual speed of generating new tokens, including all neural network computations.

5. Total Processing Time

Total Time: 3,869.00 ms
Tokens Processed: 34 tokens
Average Speed: ≈8.79 tokens per second

Key Insights

Performance Bottlenecks:
- The main bottleneck is in the evaluation phase (actual token generation)
- While sampling can handle 14K+ tokens per second, actual generation is limited to about 9 tokens per second
- This difference highlights that the neural network computations, not the token selection process, are the limiting factor
Processing Stages:
- Model loading is a significant but one-time cost
- Prompt evaluation is slightly slower than subsequent token generation
- Sampling is extremely fast compared to evaluation
Overall Performance:
- The system demonstrates typical performance characteristics for a CPU-based language model
- The total processing rate of ~9 tokens per second is reasonable for local inference on consumer hardware

Relevant log output

Jan 28 '25 23:01 loretoparisi

llama.cpp llama.cpp copied to clipboard

Research: Benchmarking DeepSeek-R1 IQ1_S 1.58bit

Research Stage

Previous existing literature and research

Command

Model

Hardware

Hypothesis

Implementation

Analysis

Llama.cpp Performance Analysis

Raw Benchmarks

Detailed Analysis

1. Token Sampling Performance

2. Model Loading

3. Prompt Evaluation

4. Generation Evaluation

5. Total Processing Time

Key Insights

Relevant log output

llama.cpp
llama.cpp copied to clipboard