stanford_alpaca
stanford_alpaca copied to clipboard
Problem with finetuning bloom
What is the fsdp_transformer_layer_cls_to_wrap
for bloom?
When I tried to fine tune with bloomz-7b1, the training stuck on 0%. As you said in the readme, it's most likely because I dont set the right fsdp_transformer_layer_cls_to_wrap
. But I cant find it in the bloom config.
Kindly need a help on this. Thank you
I get the same question. Does the traing code here only support llama or opt model? Can we finetune the bloom using its official training framework with stanford_alpaca data?
any help on this?
No, I have the same issue. Do u know BELLE, they use bloom as the base model instead of llama.
No, I have the same issue. Do u know BELLE, they use bloom as the base model instead of llama.
I've read it and it's exactly what I’m looking for. However, I can't find the finetuning script, any help on this?
It seems like the finetuning script is referring back to this repo based on this https://github.com/LianjiaTech/BELLE/issues/26, which is our problem
I have same issue.
Are u Chinese? 说汉语呗
u can check bloom training code in bloom github. Bloom already opens its trainning code, I think u can find tranning code there.
change to this: --fsdp_transformer_layer_cls_to_wrap 'BloomBlock' and it works
change to this: --fsdp_transformer_layer_cls_to_wrap 'BloomBlock' and it works
thanks, but still error: tensor a (256905216) must match the size of tensor b (1027620864) is there hyper param need to be fix?
change to this: --fsdp_transformer_layer_cls_to_wrap 'BloomBlock' and it works
still gets the same error, what type of bloom model are you running? can you please share the training script?
how do you run it?
I used this to run with the original training script:
torchrun --nproc_per_node=3 --master_port=5001 train.py \ --model_name_or_path bigscience/bloomz-7b1 \ --data_path ./alpaca_data.json \ --bf16 True \ --output_dir ./model_trained \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap ‘BloomBlock‘ \ --tf32 True
and gets this error:
Exception: Could not find the transformer layer class to wrap in the model.
how do you run it?
i can run my code, it can load model and data, but still mem error like this:
CUDA out of memory. Tried to allocate 770.00 MiB (GPU 0; 79.35 GiB total capacity; 75.33 GiB already allocated; 679.19 MiB free; 77.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management an
change your transformers>=4.23 and try
I use transformers 4.27.4
how do you run it?
i can run my code, it can load model and data, but still mem error like this:
CUDA out of memory. Tried to allocate 770.00 MiB (GPU 0; 79.35 GiB total capacity; 75.33 GiB already allocated; 679.19 MiB free; 77.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management an
its because you lack gpu memory, try to run it with more gpu
I think your code is ok