mPLUG-DocOwl icon indicating copy to clipboard operation
mPLUG-DocOwl copied to clipboard

Deadlock and CUDA OOM Issues with finetune_docowl.sh Using DeepSpeed Zero 2 and Zero 3

Open cryingjin opened this issue 4 months ago • 0 comments

Hello. Thank you very much for sharing such great results. I really want to fine-tune and use this model.

As far as I have understood so far, when running finetune_docowl.sh, there is a deadlock issue with DeepSpeed stage 3 and 3-offload (zero 3, zero-offload), and it seems to be the same with zero2 and 3 of finetune_docowl_lora.sh.

Currently, I haven't been able to use finetune_docowl.sh (w/ zero2) due to a CUDA OOM issue.

Am I understanding this correctly? If you have resolved any of these deadlock issues, please share.

cryingjin avatar Oct 08 '24 04:10 cryingjin