Not able see Scaling performance with NuC (12th Gen) with deepseek_r1_distill_llama_8b_q40
I am trying to reproduce the resources on NuC but i see number of token/sec drops when i add more nodes. any help?
System: 4xNuC ((12th Gen)) with AVX2 support.
1xNuC ((12th Gen)) with AVX2 support. --> ./dllama inference --model models/deepseek_r1_distill_llama_8b_q40/dllama_model_deepseek_r1_distill_llama_8b_q40.m --tokenizer models/deepseek_r1_distill_llama_8b_q40/dllama_tokenizer_deepseek_r1_distill_llama_8b_q40.t --buffer-float-type q80 --nthreads 8 --max-seq-len 4096 --prompt "What is 5+9?" --steps 77 Evaluation nBatches: 32 nTokens: 7 tokens/s: 14.96 (66.86 ms/tok) Prediction nTokens: 70 tokens/s: 5.51 (181.43 ms/tok)
2xNuC ((12th Gen)) with AVX2 support. --> ./dllama inference --model models/deepseek_r1_distill_llama_8b_q40/dllama_model_deepseek_r1_distill_llama_8b_q40.m --tokenizer models/deepseek_r1_distill_llama_8b_q40/dllama_tokenizer_deepseek_r1_distill_llama_8b_q40.t --buffer-float-type q80 --nthreads 8 --max-seq-len 4096 --prompt "What is 5+9?" --steps 77 --workers 10.10.10.2:9998
Evaluation nBatches: 32 nTokens: 7 tokens/s: 9.25 (108.14 ms/tok) Prediction nTokens: 70 tokens/s: 5.96 (167.91 ms/tok)
4xNuC ((12th Gen)) with AVX2 support. --> ./dllama inference --model models/deepseek_r1_distill_llama_8b_q40/dllama_model_deepseek_r1_distill_llama_8b_q40.m --tokenizer models/deepseek_r1_distill_llama_8b_q40/dllama_tokenizer_deepseek_r1_distill_llama_8b_q40.t --buffer-float-type q80 --nthreads 8 --max-seq-len 4096 --prompt "What is 5+9?" --steps 77 --workers 10.10.10.2:9998 10.10.10.4:9998 10.10.10.5:9998
Evaluation nBatches: 32 nTokens: 7 tokens/s: 6.74 (148.29 ms/tok) Prediction nTokens: 70 tokens/s: 5.02 (199.27 ms/tok)
Any help here. is this expected?
How did you start the workers?
One each of the worker, i ran "./dllama worker --port 9998 --nthreads 8" on the Root node, "./dllama inference --model models/deepseek_r1_distill_llama_8b_q40/dllama_model_deepseek_r1_distill_llama_8b_q40.m --tokenizer models/deepseek_r1_distill_llama_8b_q40/dllama_tokenizer_deepseek_r1_distill_llama_8b_q40.t --buffer-float-type q80 --nthreads 8 --max-seq-len 4096 --prompt "What is 5+9?" --steps 77 --workers 10.10.10.2:9998 10.10.10.4:9998 10.10.10.5:9998"
Hello @deepaks2,
please upgrade DL to 0.12.8 and put here logs from inference mode. This version shows the time needed for inference and synchronization.
@b4rtaz Thanks I will share the details
@b4rtaz Please find the logs
2xNuC ((12th Gen)) with AVX2 support. -->
4xNuC ((12th Gen)) with AVX2 support. -->
All 4 NuC are connected via switch.
It seems that synchronization over Ethernet is very slow. Maybe you should try connecting the two devices directly without a router and compare the results. If I see correctly, the NUC 12th Gen should have 2.5G Ethernet. Thunderbolt 4 can also be used for networking, but it is not easy to configure (I haven't tried it myself).
Thanks @b4rtaz. I trieed connecting two devices directly without a router and results are slightly better. It improved by 1token/sec
I see only slightly better results from 5.98 token/sec (with router) & 6.27. tokens/sec (direct). I see only 10ms difference in sync.