torchtune
torchtune copied to clipboard
fp32 Full Training seems to be taking a lot of memory
On 6 GPUs this is taking ~30GB/device which doesn't seem right. This needs some debugging.
Is this with the default configs @kartikayk, or are you setting a higher batch size which could contribute to activation memory?
@rohan-varma just BS=1 with the default config.
Hi @rohan-varma , I'm also having the same problem with full fine-tuning (fp32) on 8xV100 (32GB)
- activation checkpointing enabled
- FSDP: FULL_SHARD
- batch_size = 1
Training can still run but there're high number of CUDA malloc retries I'm not sure what the problem is, but in comparison with Mistral-7B-v0.1 (not using torchtune) I don't see such a huge usage in GPU memory (same config can run with batch_size = 2)
@chris-tng Thanks for flagging this! Taking a look today and I should have an update this week.
question - are you experiencing this on llama3 or llama2 workload? I'll probably default to testing on llama3 workload but would be good to confirm.
hey @chris-tng , i am closing it since its stale. If you are still having issues, please feel free to reopen it! Thanks :)