ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

The performance of model parallelism (MP) is not good

Open feifeibear opened this issue 3 years ago • 6 comments

Hello developers.

I found the performance of MP provided is not good. I compared it with PatrickStar and DeepSpeed. Can you check it with me? See MR #115 BTW: I strongly recommend you add Tflops as an indicator of performance.

Platform: a node of SuperPod including 8xA100 and 1TB memory CPU. BS = batch size, pstar=PatrickStar, deeps=DeepSpeed Entries indicate the Throughput (batch/elapse). Xd-Xmp is using Colossal-AI.

Model Scale global BS 1d-4mp 1d-8mp 2d-4mp 2d-8mp 3d-4mp 2.5d-4mp pstar deeps deeps-mp4 deeps-mp8
4B 8 7.61 7.62 9.89 8.47 failed 10.31 8.78 1.15 1.26 1.26
4B 16. OOM OOM OOM OOM OOM OOM 16.67 2.26 2.42 2.36
4B 128 OOM OOM OOM OOM OOM OOM 28.39 12.51 10.80 OOM
10B 2 OOM 3.62 OOM failed OOM OOM - - 0.15 0.15
10B 4 OOM 4.66 OOM OOM OOM OOM - - 0.30 0.30
10B 128 OOM OOM OOM OOM OOM OOM 13.43 OOM 6.31 5.73
  1. As you can see, the computing efficiency is the lowerest among the three solutions on 1 node scale. However, Colossal-AI is very competitive on the same batch size. Unfortunately, the batch size severely limits Colossal-AI performance.
  2. The 2.5d-MP is superior on 4B-8bs. But the 1d-8mp has a better generalization.
  3. Heterologous Training (like PatrickStar and DeepSpeed) may be a better solution, rather than a complex MP strategy, on 1 node scale.

feifeibear avatar Jan 06 '22 11:01 feifeibear

Hi @feifeibear . Thank you so much for your effort. We would appreciate it if you could also share the configurations used to test the same models with Deepspeed and PatrickStar? We would like to evaluate and improve the performance on a similar node scale as well as larger scale. BTW, did you try 3d-8mp? 3d requires a cube number of mp.

kurisusnowdeng avatar Jan 06 '22 12:01 kurisusnowdeng

The DeepSpeed benchmark script https://github.com/feifeibear/DeepSpeedZeRO3Benchmark The PatrickStar https://github.com/Tencent/PatrickStar/blob/master/examples/run_transformers.sh The benchmarking is very easy.

export SUFFIX="colossal_compare"
env GPU_NUM=8 MODEL_TYPE="GPT" MODEL_NAME=GPT3_10B BS=2 CPU_EBD=0 AMM=1 MSC=1 CACHE=1 SP=0 CS=288 HYB=1 TILING=0 ACT_OFFLOAD=0 SUFFIX=${SUFFIX} bash run_transformers.sh

feifeibear avatar Jan 07 '22 03:01 feifeibear

I have uploaded the logs of DeepSpeed and PatirckStar to Baidu WangPan... Note that for DeepSpeed, the SamplesPerSec is not equal to 'Throughput'. You have to calculate it by batch/elapse.

link: https://pan.baidu.com/s/1vEHl0hPuxDb7HjOlpuW-YA?pwd=1mfd code: 1mfd

feifeibear avatar Jan 07 '22 03:01 feifeibear

@feifeibear Thank you!

kurisusnowdeng avatar Jan 07 '22 03:01 kurisusnowdeng

This issue is stale because it has been open for 14 days with no activity.

github-actions[bot] avatar Jan 22 '22 00:01 github-actions[bot]

Thanks for your report, detailed tests with stable code will come soon.

binmakeswell avatar Apr 13 '22 04:04 binmakeswell

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 13 '23 03:04 binmakeswell