Hongxin Liu
Hongxin Liu
Hi, you don't need to do all-reduce in your cutomized models, as all-reduce is done in `col_nn.Linear`. See https://github.com/hpcaitech/ColossalAI/blob/91a5999825137ffb4d575b21bf4c6cb41033161a/colossalai/nn/layer/parallel_1d/layers.py#L664
You can use `torch.softmax(..., dtype=torch.float)` to cast inputs to fp32 as a workaround. We may design a more flexible AMP in the future.
> Hi, we are refactoring codes. Server and inference engine will preempt CPU now, which may lead to lag. This will be solved soon.
相同batch size,zero不会比DDP快,你使用zero时需要调大batch size
> 不是相同的batch_size, 显存16G,batch_size选择在占用显存15G到显存16G之间, 比较的是测试5W样本数,一轮所需要的时间 zero运行前需要设置环境变量,OMP_NUM_THREADS=8
能提供一下torch.profile的结果嘛,我们后续可以看看性能瓶颈。我初步猜想通信是瓶颈。
For some loss fuction, like cross entropy loss, fp16 output is OK. Casting output back to fp32 may increase the memory usage during backward, and loss less precision. You can...
> I understand that fp32 would increase memory footprint, but I don't understand why it would be less precise. loss less precision, I mean, more precise.
⚠️ Keep in mind: multiple processes may access the same file, you should make sure the JSON file is consistent
We currently just want to support more parallel training methods. As ZeRO and PP are both important and useful parallel training methods, we think users may want to use them...