nebuly
nebuly copied to clipboard
[Chatllama] Supervised Finetune on LLaMA-7B
I tried python main.py artifacts/config/config_uie.yaml --type ACTOR
with the SHP dataset, but got nan loss.
Here is my config_uie.yaml
. The other parts (trainer_config, reward_config, critic_config) are the same with the original config.yaml
. Could you please tell me how I can fix this problem? Thank you! :)
actor_config:
model: "llama-7B"
model_path: "/root/InstructUIE/run_llama/llama/7B"
checkpoint_folder: "/root/InstructUIE/run_llama/llama/7B/checkpoints"
tokenizer_folder: "/root/InstructUIE/run_llama/llama/tokenizer.model"
train_dataset_path: "./datasets/actor_training_data.json"
validation_dataset_path: null
froze_embeddings: True
use_fairscale: True
max_sequence_length: 1024
max_tokens: 512
temperature: 0.8
batch_size: 1
iteration_per_print: 1
lr: 0.00001
epochs: 3
deepspeed_enable: False
deepspeed_config_path: "/root/InstructUIE/ds_configs/stage2_llama.config"
try to use bf16
, work for me
try to use
bf16
, work for me
It works! Thanks a lot!
@cmnfriend Same problem, Could you please share your working config file that use bf16? Thank you!
@cmnfriend Same problem, Could you please share your working config file that use bf16? Thank you!
config.yaml
actor_config:
model: "llama-7B"
model_path: "/root/InstructUIE/run_llama/llama/7B"
checkpoint_folder: "/root/InstructUIE/run_llama/llama/7B/checkpoints"
tokenizer_folder: "/root/InstructUIE/run_llama/llama/tokenizer.model"
train_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_train.json"
validation_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_dev.json"
froze_embeddings: True
use_fairscale: True
max_sequence_length: 1024
max_tokens: 512
temperature: 0.8
batch_size: 1
iteration_per_print: 1
lr: 0.00001
epochs: 3
deepspeed_enable: True
deepspeed_config_path: "./artifacts/config/ds_config_llama.json
ds_config
{
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 16,
"bfloat16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 0.0001,
"betas": [0.9, 0.999],
"eps": 1e-8,
"weight_decay": 0.1
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
}
}
@cmnfriend Same problem, Could you please share your working config file that use bf16? Thank you!
config.yaml
actor_config: model: "llama-7B" model_path: "/root/InstructUIE/run_llama/llama/7B" checkpoint_folder: "/root/InstructUIE/run_llama/llama/7B/checkpoints" tokenizer_folder: "/root/InstructUIE/run_llama/llama/tokenizer.model" train_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_train.json" validation_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_dev.json" froze_embeddings: True use_fairscale: True max_sequence_length: 1024 max_tokens: 512 temperature: 0.8 batch_size: 1 iteration_per_print: 1 lr: 0.00001 epochs: 3 deepspeed_enable: True deepspeed_config_path: "./artifacts/config/ds_config_llama.json
ds_config
{ "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 16, "bfloat16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 0.0001, "betas": [0.9, 0.999], "eps": 1e-8, "weight_decay": 0.1 } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true } }
Thanks a lot, the training started and loss is not nan anymore. (with lr=0.0001)
But I also get logs like this:
[2023-03-13 14:10:53,354] [INFO] [stage_1_and_2.py:1784:step] [deepspeed] OVERFLOW! Rank 0 Skipping step.
Is that normal?
And have you trained 13B or do you know how to train 7B with multiple GPU?
I tried 13B with 2 GPUs but got:
size mismatch for output.weight: copying a param with shape torch.Size([16000, 5120]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).
@cmnfriend Same problem, Could you please share your working config file that use bf16? Thank you!
config.yaml
actor_config: model: "llama-7B" model_path: "/root/InstructUIE/run_llama/llama/7B" checkpoint_folder: "/root/InstructUIE/run_llama/llama/7B/checkpoints" tokenizer_folder: "/root/InstructUIE/run_llama/llama/tokenizer.model" train_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_train.json" validation_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_dev.json" froze_embeddings: True use_fairscale: True max_sequence_length: 1024 max_tokens: 512 temperature: 0.8 batch_size: 1 iteration_per_print: 1 lr: 0.00001 epochs: 3 deepspeed_enable: True deepspeed_config_path: "./artifacts/config/ds_config_llama.json
ds_config
{ "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 16, "bfloat16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 0.0001, "betas": [0.9, 0.999], "eps": 1e-8, "weight_decay": 0.1 } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true } }
Oh, got an error:
Epoch: 1/32, Iteration: 205/367127, Training Loss: 2.546875
Traceback (most recent call last):
File "/data1/zhanyu/neox-test/chatllama-test/artifacts/main.py", line 51, in <module>
actor_trainer.train()
File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/rlhf/actor.py", line 369, in train
est_output = self.model_engine(
File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1832, in forward
loss = self.module(*inputs, **kwargs)
File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "<@beartype(chatllama.rlhf.actor.ActorModel.forward) at 0x7fcc6f3e40d0>", line 51, in forward
File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/rlhf/actor.py", line 114, in forward
model_output = self.model.forward(
File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/llama_model.py", line 475, in forward
logits = self._forward(tokens, attention_mask)
File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/llama_model.py", line 508, in _forward
h, _, _ = layer(h, kv_mask, freqs_cis)
File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/llama_model.py", line 402, in forward
attn, cache_k, cache_v = self.attention.forward(
File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/llama_model.py", line 288, in forward
xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)
File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/llama_model.py", line 195, in apply_rotary_emb
freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/llama_model.py", line 181, in reshape_for_broadcast
assert freqs_cis.shape == (x.shape[1], x.shape[-1])
AssertionError
Got the same assertion error @bnuzhanyu
@cmnfriend Same problem, Could you please share your working config file that use bf16? Thank you!
config.yaml
actor_config: model: "llama-7B" model_path: "/root/InstructUIE/run_llama/llama/7B" checkpoint_folder: "/root/InstructUIE/run_llama/llama/7B/checkpoints" tokenizer_folder: "/root/InstructUIE/run_llama/llama/tokenizer.model" train_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_train.json" validation_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_dev.json" froze_embeddings: True use_fairscale: True max_sequence_length: 1024 max_tokens: 512 temperature: 0.8 batch_size: 1 iteration_per_print: 1 lr: 0.00001 epochs: 3 deepspeed_enable: True deepspeed_config_path: "./artifacts/config/ds_config_llama.json
ds_config
{ "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 16, "bfloat16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 0.0001, "betas": [0.9, 0.999], "eps": 1e-8, "weight_decay": 0.1 } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true } }
Thanks a lot, the training started and loss is not nan anymore. (with lr=0.0001) But I also get logs like this:
[2023-03-13 14:10:53,354] [INFO] [stage_1_and_2.py:1784:step] [deepspeed] OVERFLOW! Rank 0 Skipping step.
Is that normal?And have you trained 13B or do you know how to train 7B with multiple GPU? I tried 13B with 2 GPUs but got:
size mismatch for output.weight: copying a param with shape torch.Size([16000, 5120]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).
Fow now I have trained LLaMA-7B with multiple GPUs.
I think you can simply modify def load_checkpoints
in llama_model.py
like this, and set use_fairscale
to False:
Then you can use deepspeed to automatically perform the necessary operations required for distributed data parallel training. https://www.deepspeed.ai/getting-started/
While it may not work for 13B and larger LLaMA.
I was wondering how it works, I think this modification is not model parallel but more like data parallel? @cmnfriend and did you met the shape error?
I was wondering how it works, I think this modification is not model parallel but more like data parallel? @cmnfriend and did you met the shape error?
Not yet...
OK, so are you using your own data or the data from the example(opt-1.3b)?
My own data.
Thanks, so I guess the problem was caused by the data. And your training cmd is by deepspeed or torchrun or just use python? I doubt that your modification will have no effect, each GPU just trains itself, and may overwrite the others' weight, or just have the same weight(if using the same data)
Here is my training cmd: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 deepspeed main.py artifacts/config/config_uie.yaml --type ACTOR
It seems that each GPU trains with different data. If the world size=6, the training of each GPU will end when Iteration:193/1162
, instead of Iteration: 1162/1162
. But I'm not quite sure how they communicate with each other, so maybe you are right? Thanks for pointing it out :)
@cmnfriend Great work, thanks for your help!! It would be amazing if you could open a PR with the updates that you have done to DeepSpeed!
try to use
bf16
, work for me
By using bf16, although the training loss seemed normal, the parameters of the model turned to 0 after being saved. Did you meet with the same problem?
I notice that before saving the model, dtype is bfloat16
. while after saving it, dtype is float16
. Maybe this is the problem?
The above picture shows the model parameters before running the code torch.save({"model": self.model.state_dict()}, path)
, and the picture below shows the model parameters after model.load_state_dict(checkpoint, strict=False)
.
————————————————————————————————————————————
I know why...
In actor.py
, the checkpoint is saved as the value of the key "model", while it is not necessary when saving the LLaMA model.
So in llama_model.py
, def load_checkpoints
should be modified like this:
我知道为什么... 在 中
actor.py
,检查点被保存为键“model”的值,而在保存 LLaMA 模型时则不需要。所以在 中
llama_model.py
,def load_checkpoints
应该这样修改:
Thank you, after reading your answer, I successfully got the code to run on multiple GPUs, but got an error when saving (modified to the code you provided). May I ask if you have encountered it?
@81549361 Nope...
I saw the poster's @cmnfriend changes about the load_checkpoints() function. It seems like each gpu has the whole model (which means we do not go model parallel for this case). @bnuzhanyu
However, definitely, it is a nice project to show how to finetune 7B with multiple gpus with deepspeed. Thanks. In my case, I only have 2 A100 (80G) GPUs. I may need to consider how to implement a 'model parallel' with deepspeed for finetuning. @cmnfriend BTW, is that possible to have ur QQ account to ask for ur help.
Or can we create a QQ chat group for this. Everybody can join in and solve problem together
She seems to be Japanese, does Japan also use QQ?😊
She seems to be Japanese, does Japan also use QQ?😊
But question is 'if u see the screenshot clearly, u can find that her terminal is Chinese'. LOL
You have good eyesight!!
You have good eyesight!!
Did you run it successfully? I think we can provide a QQ group chat account at this post for everybody join in?
You have good eyesight!!
Did you run it successfully? I think we can provide a QQ group chat account at this post for everybody join in?
keyi!!
You have good eyesight!!
Did you run it successfully? I think we can provide a QQ group chat account at this post for everybody join in?
keyi!!
So the QQ group chat number is: 397447632, welcome everybody for discussion
You have good eyesight!!
Did you run it successfully? I think we can provide a QQ group chat account at this post for everybody join in?
keyi!!
So the QQ group chat number is: 397447632, welcome everybody for discussion
i have applied