ColossalAI
ColossalAI copied to clipboard
[FEATURE]: Can you push the Compared code in the readme?
Describe the feature
I have tested this frame, and the used memory is reduced, but the speed don't have booted, can you push the code examples in the readme?
Hi @wjizhong All example code can be found at https://github.com/hpcaitech/ColossalAI-Examples
Other factors such as hardware architecture, network communication, and parameter settings all have a impact on performance, and even the exact same code on different hardware can vary. Perhaps you can post details of your environment and parameter configuration to help analyze your issue.
We are also working on improving the level of automation and it will be updated in the near future.
测试的环境为系统Centos 7, 显卡Tesla P100, 驱动版本470.82, cuda版本11.3, pytorch为1.12.1, pytorch-lightning为1.7.0, colossalai为0.1.8+torch1.12cu11.3
对比的框架为lightning, 模型为BertModel.from_pretrained("chinese-roberta-wwm-ext-large"), trainer = pl.Trainer( default_root_dir=args.output_path, gradient_clip_val=1, accumulate_grad_batches=1, max_epochs=int(args.epochs), gpus=[int(item) for item in args.gpus.split("|")], strategy="ddp_sharded", limit_val_batches=0.0, precision=16, enable_progress_bar=True)
对比的参考文件为features/zero/train.py,
在8张GPU训练下, 时间方面没有明显的提速
相同batch size,zero不会比DDP快,你使用zero时需要调大batch size
不是相同的batch_size, 显存16G,batch_size选择在占用显存15G到显存16G之间, 比较的是测试5W样本数,一轮所需要的时间
不是相同的batch_size, 显存16G,batch_size选择在占用显存15G到显存16G之间, 比较的是测试5W样本数,一轮所需要的时间
zero运行前需要设置环境变量,OMP_NUM_THREADS=8
这个也设置了, 还设置很高54, CPU利用率起来了, 但是 1 Epoch的速度还是不快
能提供一下torch.profile的结果嘛,我们后续可以看看性能瓶颈。我初步猜想通信是瓶颈。
We have updated a lot. This issue was closed due to inactivity. Thanks.