KAKSIS comments

Repositories
Issues
Comments

Results 2 comments of


                                            KAKSIS

[BUG] OOM when train 70B models using deepspeed 0.16.4

I encountered the same issue during the 14B-MoE training. It appears that versions of DeepSpeed 0.16.0 and above require more GPU memory.

OOM when unwrap_model_for_generation

I encountered a similar issue while training a 72B model on an 8x H100 (80G) setup. I’m using the Hugging Face online DPO trainer scripts from [this link](https://huggingface.co/docs/trl/main/en/online_dpo_trainer). To reduce...