FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

Finetuning on 16 Tesla K80 GPUs on EC2 Instance (p2.16xlarge)

Open ItsCRC opened this issue 2 years ago • 5 comments
trafficstars

I am trying to finetune Vincuna-7B on 16 Tesla K80(12 GB) GPUs. I am getting Runtime: Expected is_sm90 || is_sm8x || is_sm75 to be true, but got false. (Could this error message be improved? If so, Please report an enhancement request to PyTorch). My torch version is 2.00+cu117 with Ubuntu 20.04 & python 3.8. Any help?

ItsCRC avatar May 03 '23 13:05 ItsCRC

I think k80 is too old for flash-attention. https://github.com/HazyResearch/flash-attention/issues/148 Maybe you can run train.py instead of train_mem.py.

suc16 avatar May 04 '23 03:05 suc16

Yeah, I read that. And ran train.py but then got CUDA out-of-memory error. Is there any other way? I tried lower per_device_batch_size of 1 for evaluation too. Still, out-of-memory issue.

ItsCRC avatar May 04 '23 03:05 ItsCRC

Yeah, I read that. And ran train.py but then got CUDA out-of-memory error. Is there any other way? I tried lower per_device_batch_size of 1 for evaluation too. Still, out-of-memory issue.

try --fsdp "full_shard offload auto_wrap"?

Memory-usage per GPU reduce from almost 40GB to 32222MiB,after using CPU offload. (my environment is 4*A100 40G)

suc16 avatar May 04 '23 03:05 suc16

OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 11.17 GiB total capacity; 1.87 GiB already allocated; 43.81 MiB free; 1.97 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

ItsCRC avatar May 04 '23 06:05 ItsCRC

My command is: torchrun --nproc_per_node=16 --master_port=20001 fastchat/train/train.py --model_name_or_path Vicuna_Weights --data_path dummy_data.json --fp16 True --output_dir Output_Weights --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 16 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 10 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard offload auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --model_max_length 2048 --gradient_checkpointing True --lazy_preprocess True

ItsCRC avatar May 04 '23 06:05 ItsCRC

I have changed my instance that supports A100 and it is working now. Thank You,

ItsCRC avatar May 12 '23 06:05 ItsCRC