stanford_alpaca Error: CUDA out of memory when batch

I use A100 40GB, load llama 7B model, run command below:

torchrun --nproc_per_node=8 --master_port=25001 train.py \
    --model_name_or_path  /home/model_zoo/llama/7B/hugging_face_format/ \
    --data_path /home/Repository/LLM/stanford_alpaca/alpaca_data.json \
    --bf16 True \
    --output_dir /home/Repository/LLM/stanford_alpaca/output/alpaca/sft_7b \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "tensorboard" \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True

When finish training comes errors:

{'train_runtime': 5162.7837, 'train_samples_per_second': 10.072, 'train_steps_per_second': 0.157, 'train_loss': 1.0267484738615347, 'epoch': 1.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 812/812 [1:26:02<00:00,  6.36s/it]

/opt/python3.10.11/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:2224: 
UserWarning: Failed to clone() tensor with name _fsdp_wrapped_module._fpw_module.model.layers.28.mlp.down_proj.weight. This may mean that this state_dict entry could point to invalid memory regions after returning from 
state_dict() call if this parameter is managed by FSDP. 
Please check clone implementation of _fsdp_wrapped_module._fpw_module.model.layers.28.mlp.down_proj.weight. 
Error: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 3; 39.59 GiB total capacity; 35.81 GiB already allocated; 
79.19 MiB free; 37.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to 
avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

What should I do to avoid this error ？ Thanks a lot!

Apr 07 '23 09:04 MrRace

Faced the same error. I was able to resolve it using model().cuda().half()

But when I tested the model the results i got was something like this https://user-images.githubusercontent.com/88507331/230518491-741b0f32-de9d-433c-ba6f-8d85085d7578.png

Not sure if that's the same is your case as well.

If that's the same case then please do let me know.