stanford_alpaca finetuning on 3090, is it possible?

trafficstars

Is it possible to finetune the 7B model using 8*3090? I had set:

--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \

but still got OOM:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 194.00 MiB (GPU 0; 23.70 GiB total capacity; 22.21 GiB already allocated; 127.56 MiB free; 22.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

my scruptis as follows:

torchrun --nproc_per_node=4 --master_port=12345 train.py
--model_name_or_path ../llama-7b-hf
--data_path ./alpaca_data.json
--bf16 True
--output_dir ./output
--num_train_epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer'
--tf32 True

Mar 17 '23 00:03 yfliao

Try use AdamW 8b optimizer from: https://github.com/TimDettmers/bitsandbytes/tree/ec5fbf4cc44324829307138a4c17fd88dddd9803 After installation, just add flag to the script call: --optim adamw_bnb_8bit Current Transformers version natively supports bitsandbytes.

With AdamW 8b optimizer I run training on 4x Quadro RTX 8000 49GB. RTX 8000 isn't a ampere GPU, so instead of bf16 and tf32 low precision , I use fp16.

Mar 20 '23 15:03 Wojx

With AdamW 8b optimizer I run training on 4x Quadro RTX 8000 49GB

That doesn't help us to fine-tune on a single 24GB RTX 3090, no?

Mar 21 '23 05:03 otto-dev

stanford_alpaca stanford_alpaca copied to clipboard

finetuning on 3090, is it possible?

stanford_alpaca
stanford_alpaca copied to clipboard