Boxiang Wang
Boxiang Wang
Hi, could you provide your training code for us to reproduce this bug? Besides, could you double-check your dataset settings?
I have tried our code with a simple change of model from resnet to shufflenet. It takes about 32521MiB with`BATCH_SIZE = 16384`, and no OOM occurred.
Hi @songyuc, you can uninstall your current `colossalai` and install our latest version with ```` git clone https://github.com/hpcaitech/ColossalAI.git cd ColossalAI # install dependency pip install -r requirements/requirements.txt # install colossalai...
Have you tried modifying [.wslconfig](https://learn.microsoft.com/en-us/windows/wsl/wsl-config) file for more memory and more processors? It works for me.
Yes, this was an NVbug about NeMo 1.0. We are not going to save .nemo in 2.0 right now
@maanug-nv Can you help approve this again? It just passed all tests.
I think this change could not be generally applied to all kinds of model loading. Maybe it should be added per customers' need
Hi, thanks for your issue, we were aware of this bug and have already come up with a fix for 0.11 release. It will further be integrated with other pos_emb...
Hi @yzlnew, it should be fixed with https://github.com/NVIDIA/Megatron-LM/blob/00efe37a85194a521789778ae47299ce8c054dc0/megatron/core/transformer/multi_latent_attention.py#L363