ColossalAI The performance of model parallelism (MP) is not good

Hello developers.

I found the performance of MP provided is not good. I compared it with PatrickStar and DeepSpeed. Can you check it with me? See MR #115 BTW: I strongly recommend you add Tflops as an indicator of performance.

Platform: a node of SuperPod including 8xA100 and 1TB memory CPU. BS = batch size, pstar=PatrickStar, deeps=DeepSpeed Entries indicate the Throughput (batch/elapse). Xd-Xmp is using Colossal-AI.

Model Scale	global BS	1d-4mp	1d-8mp	2d-4mp	2d-8mp	3d-4mp	2.5d-4mp	pstar	deeps	deeps-mp4	deeps-mp8
4B	8	7.61	7.62	9.89	8.47	failed	10.31	8.78	1.15	1.26	1.26
4B	16.	OOM	OOM	OOM	OOM	OOM	OOM	16.67	2.26	2.42	2.36
4B	128	OOM	OOM	OOM	OOM	OOM	OOM	28.39	12.51	10.80	OOM
10B	2	OOM	3.62	OOM	failed	OOM	OOM	-	-	0.15	0.15
10B	4	OOM	4.66	OOM	OOM	OOM	OOM	-	-	0.30	0.30
10B	128	OOM	OOM	OOM	OOM	OOM	OOM	13.43	OOM	6.31	5.73

As you can see, the computing efficiency is the lowerest among the three solutions on 1 node scale. However, Colossal-AI is very competitive on the same batch size. Unfortunately, the batch size severely limits Colossal-AI performance.
The 2.5d-MP is superior on 4B-8bs. But the 1d-8mp has a better generalization.
Heterologous Training (like PatrickStar and DeepSpeed) may be a better solution, rather than a complex MP strategy, on 1 node scale.

Jan 06 '22 11:01 feifeibear

Hi @feifeibear . Thank you so much for your effort. We would appreciate it if you could also share the configurations used to test the same models with Deepspeed and PatrickStar? We would like to evaluate and improve the performance on a similar node scale as well as larger scale. BTW, did you try 3d-8mp? 3d requires a cube number of mp.

Jan 06 '22 12:01 kurisusnowdeng

The DeepSpeed benchmark script https://github.com/feifeibear/DeepSpeedZeRO3Benchmark The PatrickStar https://github.com/Tencent/PatrickStar/blob/master/examples/run_transformers.sh The benchmarking is very easy.

export SUFFIX="colossal_compare"
env GPU_NUM=8 MODEL_TYPE="GPT" MODEL_NAME=GPT3_10B BS=2 CPU_EBD=0 AMM=1 MSC=1 CACHE=1 SP=0 CS=288 HYB=1 TILING=0 ACT_OFFLOAD=0 SUFFIX=${SUFFIX} bash run_transformers.sh

Jan 07 '22 03:01 feifeibear

I have uploaded the logs of DeepSpeed and PatirckStar to Baidu WangPan... Note that for DeepSpeed, the SamplesPerSec is not equal to 'Throughput'. You have to calculate it by batch/elapse.

link: https://pan.baidu.com/s/1vEHl0hPuxDb7HjOlpuW-YA?pwd=1mfd code: 1mfd

Jan 07 '22 03:01 feifeibear

@feifeibear Thank you!

Jan 07 '22 03:01 kurisusnowdeng

This issue is stale because it has been open for 14 days with no activity.

Jan 22 '22 00:01 github-actions[bot]

Thanks for your report, detailed tests with stable code will come soon.

Apr 13 '22 04:04 binmakeswell

We have updated a lot. This issue was closed due to inactivity. Thanks.

Apr 13 '23 03:04 binmakeswell