stanford_alpaca
stanford_alpaca copied to clipboard
finetuning on 3090, is it possible?
Is it possible to finetune the 7B model using 8*3090? I had set:
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
but still got OOM:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 194.00 MiB (GPU 0; 23.70 GiB total capacity; 22.21 GiB already allocated; 127.56 MiB free; 22.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
my scruptis as follows:
torchrun --nproc_per_node=4 --master_port=12345 train.py
--model_name_or_path ../llama-7b-hf
--data_path ./alpaca_data.json
--bf16 True
--output_dir ./output
--num_train_epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer'
--tf32 True
Try use AdamW 8b optimizer from:
https://github.com/TimDettmers/bitsandbytes/tree/ec5fbf4cc44324829307138a4c17fd88dddd9803
After installation, just add flag to the script call:
--optim adamw_bnb_8bit
Current Transformers version natively supports bitsandbytes.
With AdamW 8b optimizer I run training on 4x Quadro RTX 8000 49GB. RTX 8000 isn't a ampere GPU, so instead of bf16 and tf32 low precision , I use fp16.
With AdamW 8b optimizer I run training on 4x Quadro RTX 8000 49GB
That doesn't help us to fine-tune on a single 24GB RTX 3090, no?