The performance of model parallelism (MP) is not good
Hello developers.
I found the performance of MP provided is not good. I compared it with PatrickStar and DeepSpeed. Can you check it with me? See MR #115 BTW: I strongly recommend you add Tflops as an indicator of performance.
Platform: a node of SuperPod including 8xA100 and 1TB memory CPU. BS = batch size, pstar=PatrickStar, deeps=DeepSpeed Entries indicate the Throughput (batch/elapse). Xd-Xmp is using Colossal-AI.
| Model Scale | global BS | 1d-4mp | 1d-8mp | 2d-4mp | 2d-8mp | 3d-4mp | 2.5d-4mp | pstar | deeps | deeps-mp4 | deeps-mp8 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 4B | 8 | 7.61 | 7.62 | 9.89 | 8.47 | failed | 10.31 | 8.78 | 1.15 | 1.26 | 1.26 |
| 4B | 16. | OOM | OOM | OOM | OOM | OOM | OOM | 16.67 | 2.26 | 2.42 | 2.36 |
| 4B | 128 | OOM | OOM | OOM | OOM | OOM | OOM | 28.39 | 12.51 | 10.80 | OOM |
| 10B | 2 | OOM | 3.62 | OOM | failed | OOM | OOM | - | - | 0.15 | 0.15 |
| 10B | 4 | OOM | 4.66 | OOM | OOM | OOM | OOM | - | - | 0.30 | 0.30 |
| 10B | 128 | OOM | OOM | OOM | OOM | OOM | OOM | 13.43 | OOM | 6.31 | 5.73 |
- As you can see, the computing efficiency is the lowerest among the three solutions on 1 node scale. However, Colossal-AI is very competitive on the same batch size. Unfortunately, the batch size severely limits Colossal-AI performance.
- The 2.5d-MP is superior on 4B-8bs. But the 1d-8mp has a better generalization.
- Heterologous Training (like PatrickStar and DeepSpeed) may be a better solution, rather than a complex MP strategy, on 1 node scale.
Hi @feifeibear . Thank you so much for your effort. We would appreciate it if you could also share the configurations used to test the same models with Deepspeed and PatrickStar? We would like to evaluate and improve the performance on a similar node scale as well as larger scale. BTW, did you try 3d-8mp? 3d requires a cube number of mp.
The DeepSpeed benchmark script https://github.com/feifeibear/DeepSpeedZeRO3Benchmark The PatrickStar https://github.com/Tencent/PatrickStar/blob/master/examples/run_transformers.sh The benchmarking is very easy.
export SUFFIX="colossal_compare"
env GPU_NUM=8 MODEL_TYPE="GPT" MODEL_NAME=GPT3_10B BS=2 CPU_EBD=0 AMM=1 MSC=1 CACHE=1 SP=0 CS=288 HYB=1 TILING=0 ACT_OFFLOAD=0 SUFFIX=${SUFFIX} bash run_transformers.sh
I have uploaded the logs of DeepSpeed and PatirckStar to Baidu WangPan... Note that for DeepSpeed, the SamplesPerSec is not equal to 'Throughput'. You have to calculate it by batch/elapse.
link: https://pan.baidu.com/s/1vEHl0hPuxDb7HjOlpuW-YA?pwd=1mfd code: 1mfd
@feifeibear Thank you!
This issue is stale because it has been open for 14 days with no activity.
Thanks for your report, detailed tests with stable code will come soon.
We have updated a lot. This issue was closed due to inactivity. Thanks.