dumpmemory comments

Results 51 comments of


                                            dumpmemory

[Train/AIR] Training large models with HuggingfaceTrainer

> FSDP is already supported out of the box. Unsure if we need to support deepspeed? we need deepspeed for sure

[Train/AIR] Training large models with HuggingfaceTrainer

> FSDP is already supported out of the box. Unsure if we need to support deepspeed? we need deepspeed for sure

[Train/AIR] Training large models with HuggingfaceTrainer

> @dumpmemory Can you elaborate? What's your usecase? usually, we use deep speed's Zero 2 or 3 to train large model and for the small one , we also use...

Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

how about this one https://github.com/microsoft/DeepSpeed/issues/2637 . It seems the only option is disable zero.init with accelerate.

Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

> Actually @tohtana has just created a PR that is supposed to fix both issues: [microsoft/DeepSpeed#2989](https://github.com/microsoft/DeepSpeed/pull/2989) > > I will be able to try it probably tomorrow, but please go...

Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

https://github.com/microsoft/DeepSpeed/issues/2637 still exists with https://github.com/microsoft/DeepSpeed/pull/2989 my setting is here https://github.com/huggingface/peft/issues/161

Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

> Thank you for testing [microsoft/DeepSpeed#2989](https://github.com/microsoft/DeepSpeed/pull/2989), @dumpmemory - sorry to hear it didn't resolve the leak - perhaps file a new issue in DS, as the one I posted I...

[BUG] Failed to checkpoint with deepspeed 0.12.4

I have the same issue when train mixtral 7bx8 with transformers 4.36 and deepseed 0.12.4(0.12.3) zero3 with gradient_checkpointing enable . it hangs after around 1:30 hours traning.

[BUG] grad_norm and loss is nan when deepspeed==0.13.5 but ok with deepspeed==0.10.2

same here

[BUG] Peft Training with Zero.Init() and Zero3 will increase GPU memory every forward step

I have also try the tohtana/nested_zero_init branch, which did not fix it.