ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[FEATURE]: Can you push the Compared code in the readme?

Open wjizhong opened this issue 2 years ago • 8 comments

Describe the feature

I have tested this frame, and the used memory is reduced, but the speed don't have booted, can you push the code examples in the readme?

wjizhong avatar Aug 09 '22 07:08 wjizhong

Hi @wjizhong All example code can be found at https://github.com/hpcaitech/ColossalAI-Examples

Other factors such as hardware architecture, network communication, and parameter settings all have a impact on performance, and even the exact same code on different hardware can vary. Perhaps you can post details of your environment and parameter configuration to help analyze your issue.

We are also working on improving the level of automation and it will be updated in the near future.

binmakeswell avatar Aug 10 '22 07:08 binmakeswell

测试的环境为系统Centos 7, 显卡Tesla P100, 驱动版本470.82, cuda版本11.3, pytorch为1.12.1, pytorch-lightning为1.7.0, colossalai为0.1.8+torch1.12cu11.3

对比的框架为lightning, 模型为BertModel.from_pretrained("chinese-roberta-wwm-ext-large"), trainer = pl.Trainer( default_root_dir=args.output_path, gradient_clip_val=1, accumulate_grad_batches=1, max_epochs=int(args.epochs), gpus=[int(item) for item in args.gpus.split("|")], strategy="ddp_sharded", limit_val_batches=0.0, precision=16, enable_progress_bar=True)

对比的参考文件为features/zero/train.py, image

wjizhong avatar Aug 10 '22 11:08 wjizhong

在8张GPU训练下, 时间方面没有明显的提速

wjizhong avatar Aug 10 '22 11:08 wjizhong

相同batch size,zero不会比DDP快,你使用zero时需要调大batch size

ver217 avatar Aug 11 '22 04:08 ver217

不是相同的batch_size, 显存16G,batch_size选择在占用显存15G到显存16G之间, 比较的是测试5W样本数,一轮所需要的时间

wjizhong avatar Aug 11 '22 06:08 wjizhong

不是相同的batch_size, 显存16G,batch_size选择在占用显存15G到显存16G之间, 比较的是测试5W样本数,一轮所需要的时间

zero运行前需要设置环境变量,OMP_NUM_THREADS=8

ver217 avatar Aug 11 '22 08:08 ver217

这个也设置了, 还设置很高54, CPU利用率起来了, 但是 1 Epoch的速度还是不快

wjizhong avatar Aug 11 '22 09:08 wjizhong

能提供一下torch.profile的结果嘛,我们后续可以看看性能瓶颈。我初步猜想通信是瓶颈。

ver217 avatar Aug 11 '22 09:08 ver217

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 13 '23 04:04 binmakeswell