zuijiang

Results 3 issues of zuijiang

Hi @anicolson! I've being training with mhanet while the loss is all the way around 0.37, I noticed you upload some loss info of resnet based networks [https://github.com/anicolson/DeepXi/tree/master/log/loss](url) Would you...

**Describe the bug** I'm using Deepspeed-Megatron although, using pipeline parallelism and setting ``` "bf16": { "enabled": "auto" } ``` will step into the `NotImplementedError` in ```python #/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py def _exec_reduce_grads(self): self._force_grad_boundary...

bug
training

ENV - torch 2.1.2 - flash-attn 2.5.8 - cuda 11.7 ERROR `flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl3cow11cow_deleterEPv`