stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

Problem with finetuning bloom

Open raihan0824 opened this issue 1 year ago • 19 comments

What is the fsdp_transformer_layer_cls_to_wrap for bloom?

When I tried to fine tune with bloomz-7b1, the training stuck on 0%. As you said in the readme, it's most likely because I dont set the right fsdp_transformer_layer_cls_to_wrap . But I cant find it in the bloom config.

Kindly need a help on this. Thank you

raihan0824 avatar Mar 21 '23 02:03 raihan0824

I get the same question. Does the traing code here only support llama or opt model? Can we finetune the bloom using its official training framework with stanford_alpaca data?

frankzhao112 avatar Mar 25 '23 06:03 frankzhao112

any help on this?

raihan0824 avatar Mar 25 '23 15:03 raihan0824

No, I have the same issue. Do u know BELLE, they use bloom as the base model instead of llama.

frankzhao112 avatar Mar 26 '23 01:03 frankzhao112

No, I have the same issue. Do u know BELLE, they use bloom as the base model instead of llama.

I've read it and it's exactly what I’m looking for. However, I can't find the finetuning script, any help on this?

raihan0824 avatar Mar 26 '23 06:03 raihan0824

It seems like the finetuning script is referring back to this repo based on this https://github.com/LianjiaTech/BELLE/issues/26, which is our problem

raihan0824 avatar Mar 26 '23 07:03 raihan0824

I have same issue.

quanliu1991 avatar Mar 27 '23 02:03 quanliu1991

Are u Chinese? 说汉语呗

frankzhao112 avatar Mar 28 '23 06:03 frankzhao112

u can check bloom training code in bloom github. Bloom already opens its trainning code, I think u can find tranning code there.

frankzhao112 avatar Mar 28 '23 06:03 frankzhao112

change to this: --fsdp_transformer_layer_cls_to_wrap 'BloomBlock' and it works

floodsung avatar Mar 29 '23 03:03 floodsung

change to this: --fsdp_transformer_layer_cls_to_wrap 'BloomBlock' and it works

thanks, but still error: tensor a (256905216) must match the size of tensor b (1027620864) is there hyper param need to be fix?

weberrr avatar Mar 31 '23 08:03 weberrr

change to this: --fsdp_transformer_layer_cls_to_wrap 'BloomBlock' and it works

still gets the same error, what type of bloom model are you running? can you please share the training script?

raihan0824 avatar Mar 31 '23 09:03 raihan0824

how do you run it?

raihan0824 avatar Mar 31 '23 09:03 raihan0824

I used this to run with the original training script: torchrun --nproc_per_node=3 --master_port=5001 train.py \ --model_name_or_path bigscience/bloomz-7b1 \ --data_path ./alpaca_data.json \ --bf16 True \ --output_dir ./model_trained \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap ‘BloomBlock‘ \ --tf32 True

and gets this error: Exception: Could not find the transformer layer class to wrap in the model.

raihan0824 avatar Mar 31 '23 09:03 raihan0824

how do you run it?

i can run my code, it can load model and data, but still mem error like this:

CUDA out of memory. Tried to allocate 770.00 MiB (GPU 0; 79.35 GiB total capacity; 75.33 GiB already allocated; 679.19 MiB free; 77.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management an

weberrr avatar Mar 31 '23 09:03 weberrr

change your transformers>=4.23 and try

weberrr avatar Mar 31 '23 09:03 weberrr

I use transformers 4.27.4

raihan0824 avatar Mar 31 '23 09:03 raihan0824

how do you run it?

i can run my code, it can load model and data, but still mem error like this:

CUDA out of memory. Tried to allocate 770.00 MiB (GPU 0; 79.35 GiB total capacity; 75.33 GiB already allocated; 679.19 MiB free; 77.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management an

its because you lack gpu memory, try to run it with more gpu

raihan0824 avatar Mar 31 '23 09:03 raihan0824

I think your code is ok

weberrr avatar Mar 31 '23 10:03 weberrr