Hongxin Liu comments

Results 70 comments of


                                            Hongxin Liu

[enhancement] Examplify `all_reduce()` for tensor_parallel_*

Hi, you don't need to do all-reduce in your cutomized models, as all-reduce is done in `col_nn.Linear`. See https://github.com/hpcaitech/ColossalAI/blob/91a5999825137ffb4d575b21bf4c6cb41033161a/colossalai/nn/layer/parallel_1d/layers.py#L664

Keep Attention Softmax FP32 during FP16/ZeRO Training

You can use `torch.softmax(..., dtype=torch.float)` to cast inputs to fp32 as a workaround. We may design a more flexible AMP in the future.

Run into Deadlocks when training or inference

> Hi, we are refactoring codes. Server and inference engine will preempt CPU now, which may lead to lag. This will be solved soon.

[FEATURE]: Can you push the Compared code in the readme?

相同batch size，zero不会比DDP快，你使用zero时需要调大batch size

[FEATURE]: Can you push the Compared code in the readme?

> 不是相同的batch_size, 显存16G，batch_size选择在占用显存15G到显存16G之间, 比较的是测试5W样本数,一轮所需要的时间 zero运行前需要设置环境变量，OMP_NUM_THREADS=8

[FEATURE]: Can you push the Compared code in the readme?

能提供一下torch.profile的结果嘛，我们后续可以看看性能瓶颈。我初步猜想通信是瓶颈。

[BUG]: Zero returns fp16 tensors which causes RuntimeError

For some loss fuction, like cross entropy loss, fp16 output is OK. Casting output back to fp32 may increase the memory usage during backward, and loss less precision. You can...

[BUG]: Zero returns fp16 tensors which causes RuntimeError

> I understand that fp32 would increase memory footprint, but I don't understand why it would be less precise. loss less precision, I mean, more precise.

[FEATURE]: New way to tensotboard

⚠️ Keep in mind: multiple processes may access the same file, you should make sure the JSON file is consistent

How PP and ZeRO stage 2+ work together?

We currently just want to support more parallel training methods. As ZeRO and PP are both important and useful parallel training methods, we think users may want to use them...