torchtune fp32 Full Training seems to be taking a lot of memory

fp32 Full Training seems to be taking a lot of memory

Open kartikayk opened this issue 10 months ago • 4 comments

On 6 GPUs this is taking ~30GB/device which doesn't seem right. This needs some debugging.

Mar 29 '24 03:03 kartikayk

Is this with the default configs @kartikayk, or are you setting a higher batch size which could contribute to activation memory?

Mar 29 '24 23:03 rohan-varma

@rohan-varma just BS=1 with the default config.

Mar 30 '24 03:03 kartikayk

Hi @rohan-varma , I'm also having the same problem with full fine-tuning (fp32) on 8xV100 (32GB)

activation checkpointing enabled
FSDP: FULL_SHARD
batch_size = 1

Training can still run but there're high number of CUDA malloc retries I'm not sure what the problem is, but in comparison with Mistral-7B-v0.1 (not using torchtune) I don't see such a huge usage in GPU memory (same config can run with batch_size = 2)

Apr 22 '24 21:04 chris-tng

@chris-tng Thanks for flagging this! Taking a look today and I should have an update this week.

question - are you experiencing this on llama3 or llama2 workload? I'll probably default to testing on llama3 workload but would be good to confirm.

May 02 '24 23:05 rohan-varma

hey @chris-tng , i am closing it since its stale. If you are still having issues, please feel free to reopen it! Thanks :)

Jun 28 '24 16:06 felipemello1

torchtune torchtune copied to clipboard

fp32 Full Training seems to be taking a lot of memory

torchtune
torchtune copied to clipboard