distributed-llama Not able see Scaling performance with NuC (12th Gen) with deepseek_r1_distill_llama_8b

I am trying to reproduce the resources on NuC but i see number of token/sec drops when i add more nodes. any help?

System: 4xNuC ((12th Gen)) with AVX2 support.

1xNuC ((12th Gen)) with AVX2 support. --> ./dllama inference --model models/deepseek_r1_distill_llama_8b_q40/dllama_model_deepseek_r1_distill_llama_8b_q40.m --tokenizer models/deepseek_r1_distill_llama_8b_q40/dllama_tokenizer_deepseek_r1_distill_llama_8b_q40.t --buffer-float-type q80 --nthreads 8 --max-seq-len 4096 --prompt "What is 5+9?" --steps 77 Evaluation nBatches: 32 nTokens: 7 tokens/s: 14.96 (66.86 ms/tok) Prediction nTokens: 70 tokens/s: 5.51 (181.43 ms/tok)

2xNuC ((12th Gen)) with AVX2 support. --> ./dllama inference --model models/deepseek_r1_distill_llama_8b_q40/dllama_model_deepseek_r1_distill_llama_8b_q40.m --tokenizer models/deepseek_r1_distill_llama_8b_q40/dllama_tokenizer_deepseek_r1_distill_llama_8b_q40.t --buffer-float-type q80 --nthreads 8 --max-seq-len 4096 --prompt "What is 5+9?" --steps 77 --workers 10.10.10.2:9998

Evaluation nBatches: 32 nTokens: 7 tokens/s: 9.25 (108.14 ms/tok) Prediction nTokens: 70 tokens/s: 5.96 (167.91 ms/tok)

4xNuC ((12th Gen)) with AVX2 support. --> ./dllama inference --model models/deepseek_r1_distill_llama_8b_q40/dllama_model_deepseek_r1_distill_llama_8b_q40.m --tokenizer models/deepseek_r1_distill_llama_8b_q40/dllama_tokenizer_deepseek_r1_distill_llama_8b_q40.t --buffer-float-type q80 --nthreads 8 --max-seq-len 4096 --prompt "What is 5+9?" --steps 77 --workers 10.10.10.2:9998 10.10.10.4:9998 10.10.10.5:9998

Evaluation nBatches: 32 nTokens: 7 tokens/s: 6.74 (148.29 ms/tok) Prediction nTokens: 70 tokens/s: 5.02 (199.27 ms/tok)

Any help here. is this expected?

Mar 03 '25 14:03 deepaks2

How did you start the workers?

Mar 03 '25 15:03 D-i-t-gh

One each of the worker, i ran "./dllama worker --port 9998 --nthreads 8" on the Root node, "./dllama inference --model models/deepseek_r1_distill_llama_8b_q40/dllama_model_deepseek_r1_distill_llama_8b_q40.m --tokenizer models/deepseek_r1_distill_llama_8b_q40/dllama_tokenizer_deepseek_r1_distill_llama_8b_q40.t --buffer-float-type q80 --nthreads 8 --max-seq-len 4096 --prompt "What is 5+9?" --steps 77 --workers 10.10.10.2:9998 10.10.10.4:9998 10.10.10.5:9998"

Mar 04 '25 08:03 deepaks2

Hello @deepaks2,

please upgrade DL to 0.12.8 and put here logs from inference mode. This version shows the time needed for inference and synchronization.

Mar 04 '25 10:03 b4rtaz

@b4rtaz Thanks I will share the details

Mar 05 '25 05:03 deepaks2

@b4rtaz Please find the logs

2xNuC ((12th Gen)) with AVX2 support. -->

4xNuC ((12th Gen)) with AVX2 support. -->

All 4 NuC are connected via switch.

Mar 05 '25 13:03 deepaks2

It seems that synchronization over Ethernet is very slow. Maybe you should try connecting the two devices directly without a router and compare the results. If I see correctly, the NUC 12th Gen should have 2.5G Ethernet. Thunderbolt 4 can also be used for networking, but it is not easy to configure (I haven't tried it myself).

Mar 05 '25 17:03 b4rtaz

Thanks @b4rtaz. I trieed connecting two devices directly without a router and results are slightly better. It improved by 1token/sec

I see only slightly better results from 5.98 token/sec (with router) & 6.27. tokens/sec (direct). I see only 10ms difference in sync.

Mar 06 '25 08:03 deepaks2

Not able see Scaling performance with NuC (12th Gen) with deepseek_r1_distill_llama_8b_q40