llama-recipes
llama-recipes copied to clipboard
Not sure I have FSDP working properly; any insight on fine-tuning VRAM req's?
Sorry for the stupid question, I'm not sure why I can't seem to investigate this well, and it's driving me nuts.
I've tried running the example finetuning script on the 7B model, on a local cluster with 8x V100's 16GB's with the example command from the readme:
torchrun --nnodes 1 --nproc_per_node 8 ~/llama2/llama-recipes/examples/finetuning.py --enable_fsdp --pure_bf16 False --use_fp16 True --model_name ~/llama2/llama/llama-2-7b-hf --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint fine-tuned
but am going OOM around the time that the Training Epoch's all start showing up:
File "/home/dylanhubel/.local/lib/python3.11/site-packages/torch/optim/adamw.py", line 599, in _multi_tensor_adamw exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB. GPU 5 has a total capacty of 15.77 GiB of which 5.12 MiB is free. Including non-PyTorch memory, this process has 15.76 GiB memory in use. Of the allocated memory 13.76 GiB is allocated by PyTorch, and 540.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Training Epoch: 0/3, step 0/195 completed (loss: 2.062647819519043): 0%| | 0/195 [00:11<?, ?it/s]
When using PEFT as well, I do get through the first epoch, but go OOM when saving the checkpoints.
Just curious if anyone has any insight on the actual VRAM req's here, I was a little shocked to find the 7B model OOM'ing 128GB of VRAM, full parameter or not.
Thanks for your time, sorry again for what feels like a stupid question.