nebuly [Chatllama] Supervised Finetune on LLaMA-7B

I tried python main.py artifacts/config/config_uie.yaml --type ACTOR with the SHP dataset, but got nan loss.

Here is my config_uie.yaml. The other parts (trainer_config, reward_config, critic_config) are the same with the original config.yaml. Could you please tell me how I can fix this problem? Thank you! :)

actor_config:
  model: "llama-7B"
  model_path: "/root/InstructUIE/run_llama/llama/7B"
  checkpoint_folder: "/root/InstructUIE/run_llama/llama/7B/checkpoints"
  tokenizer_folder: "/root/InstructUIE/run_llama/llama/tokenizer.model"
  train_dataset_path: "./datasets/actor_training_data.json"
  validation_dataset_path: null
  froze_embeddings: True
  use_fairscale: True
  max_sequence_length: 1024
  max_tokens: 512
  temperature: 0.8
  batch_size: 1
  iteration_per_print: 1
  lr: 0.00001
  epochs: 3
  deepspeed_enable: False
  deepspeed_config_path: "/root/InstructUIE/ds_configs/stage2_llama.config"

Mar 12 '23 03:03 cmnfriend

try to use bf16, work for me

Mar 12 '23 05:03 HuangLK

try to use bf16, work for me

It works! Thanks a lot!

Mar 13 '23 02:03 cmnfriend

@cmnfriend Same problem, Could you please share your working config file that use bf16? Thank you!

Mar 13 '23 03:03 wgimperial

@cmnfriend Same problem, Could you please share your working config file that use bf16? Thank you!

config.yaml

actor_config:
  model: "llama-7B"
  model_path: "/root/InstructUIE/run_llama/llama/7B"
  checkpoint_folder: "/root/InstructUIE/run_llama/llama/7B/checkpoints"
  tokenizer_folder: "/root/InstructUIE/run_llama/llama/tokenizer.model"
  train_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_train.json"
  validation_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_dev.json"
  froze_embeddings: True
  use_fairscale: True
  max_sequence_length: 1024
  max_tokens: 512
  temperature: 0.8
  batch_size: 1
  iteration_per_print: 1
  lr: 0.00001
  epochs: 3
  deepspeed_enable: True
  deepspeed_config_path: "./artifacts/config/ds_config_llama.json

ds_config

{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16,
    "bfloat16": {
        "enabled": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 0.0001,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.1
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    }
}

Mar 13 '23 04:03 cmnfriend

@cmnfriend Same problem, Could you please share your working config file that use bf16? Thank you!

config.yaml

actor_config:
  model: "llama-7B"
  model_path: "/root/InstructUIE/run_llama/llama/7B"
  checkpoint_folder: "/root/InstructUIE/run_llama/llama/7B/checkpoints"
  tokenizer_folder: "/root/InstructUIE/run_llama/llama/tokenizer.model"
  train_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_train.json"
  validation_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_dev.json"
  froze_embeddings: True
  use_fairscale: True
  max_sequence_length: 1024
  max_tokens: 512
  temperature: 0.8
  batch_size: 1
  iteration_per_print: 1
  lr: 0.00001
  epochs: 3
  deepspeed_enable: True
  deepspeed_config_path: "./artifacts/config/ds_config_llama.json

ds_config

{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16,
    "bfloat16": {
        "enabled": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 0.0001,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.1
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    }
}

Thanks a lot, the training started and loss is not nan anymore. (with lr=0.0001) But I also get logs like this: [2023-03-13 14:10:53,354] [INFO] [stage_1_and_2.py:1784:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Is that normal?

And have you trained 13B or do you know how to train 7B with multiple GPU? I tried 13B with 2 GPUs but got: size mismatch for output.weight: copying a param with shape torch.Size([16000, 5120]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).

Mar 13 '23 06:03 bnuzhanyu

@cmnfriend Same problem, Could you please share your working config file that use bf16? Thank you!

config.yaml

actor_config:
  model: "llama-7B"
  model_path: "/root/InstructUIE/run_llama/llama/7B"
  checkpoint_folder: "/root/InstructUIE/run_llama/llama/7B/checkpoints"
  tokenizer_folder: "/root/InstructUIE/run_llama/llama/tokenizer.model"
  train_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_train.json"
  validation_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_dev.json"
  froze_embeddings: True
  use_fairscale: True
  max_sequence_length: 1024
  max_tokens: 512
  temperature: 0.8
  batch_size: 1
  iteration_per_print: 1
  lr: 0.00001
  epochs: 3
  deepspeed_enable: True
  deepspeed_config_path: "./artifacts/config/ds_config_llama.json

ds_config

{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16,
    "bfloat16": {
        "enabled": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 0.0001,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.1
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    }
}

Oh, got an error:

Epoch: 1/32, Iteration: 205/367127, Training Loss: 2.546875
Traceback (most recent call last):
  File "/data1/zhanyu/neox-test/chatllama-test/artifacts/main.py", line 51, in <module>
    actor_trainer.train()
  File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/rlhf/actor.py", line 369, in train
    est_output = self.model_engine(
  File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1832, in forward
    loss = self.module(*inputs, **kwargs)
  File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "<@beartype(chatllama.rlhf.actor.ActorModel.forward) at 0x7fcc6f3e40d0>", line 51, in forward
  File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/rlhf/actor.py", line 114, in forward
    model_output = self.model.forward(
  File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/llama_model.py", line 475, in forward
    logits = self._forward(tokens, attention_mask)
  File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/llama_model.py", line 508, in _forward
    h, _, _ = layer(h, kv_mask, freqs_cis)
  File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/llama_model.py", line 402, in forward
    attn, cache_k, cache_v = self.attention.forward(
  File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/llama_model.py", line 288, in forward
    xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)
  File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/llama_model.py", line 195, in apply_rotary_emb
    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
  File "/data1/zhanyu/anaconda3/envs/neox/lib/python3.9/site-packages/chatllama/llama_model.py", line 181, in reshape_for_broadcast
    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
AssertionError

Mar 13 '23 06:03 bnuzhanyu

Got the same assertion error @bnuzhanyu

Mar 13 '23 21:03 TonyZhanghm

@cmnfriend Same problem, Could you please share your working config file that use bf16? Thank you!

config.yaml

actor_config:
  model: "llama-7B"
  model_path: "/root/InstructUIE/run_llama/llama/7B"
  checkpoint_folder: "/root/InstructUIE/run_llama/llama/7B/checkpoints"
  tokenizer_folder: "/root/InstructUIE/run_llama/llama/tokenizer.model"
  train_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_train.json"
  validation_dataset_path: "/root/InstructUIE/run_llama/llama/UIE_text2text/UIE_dev.json"
  froze_embeddings: True
  use_fairscale: True
  max_sequence_length: 1024
  max_tokens: 512
  temperature: 0.8
  batch_size: 1
  iteration_per_print: 1
  lr: 0.00001
  epochs: 3
  deepspeed_enable: True
  deepspeed_config_path: "./artifacts/config/ds_config_llama.json

ds_config

{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16,
    "bfloat16": {
        "enabled": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 0.0001,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.1
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    }
}

Thanks a lot, the training started and loss is not nan anymore. (with lr=0.0001) But I also get logs like this: [2023-03-13 14:10:53,354] [INFO] [stage_1_and_2.py:1784:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Is that normal?

And have you trained 13B or do you know how to train 7B with multiple GPU? I tried 13B with 2 GPUs but got: size mismatch for output.weight: copying a param with shape torch.Size([16000, 5120]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).

Fow now I have trained LLaMA-7B with multiple GPUs. I think you can simply modify def load_checkpoints in llama_model.py like this, and set use_fairscale to False: Then you can use deepspeed to automatically perform the necessary operations required for distributed data parallel training. https://www.deepspeed.ai/getting-started/ While it may not work for 13B and larger LLaMA.

Mar 14 '23 02:03 cmnfriend

I was wondering how it works, I think this modification is not model parallel but more like data parallel? @cmnfriend and did you met the shape error?

Mar 14 '23 03:03 bnuzhanyu

I was wondering how it works, I think this modification is not model parallel but more like data parallel? @cmnfriend and did you met the shape error?

Not yet...

Mar 14 '23 07:03 cmnfriend

OK, so are you using your own data or the data from the example(opt-1.3b)?

Mar 14 '23 08:03 bnuzhanyu

My own data.

Mar 14 '23 08:03 cmnfriend

Thanks, so I guess the problem was caused by the data. And your training cmd is by deepspeed or torchrun or just use python? I doubt that your modification will have no effect, each GPU just trains itself, and may overwrite the others' weight, or just have the same weight(if using the same data)

Mar 14 '23 08:03 bnuzhanyu

Here is my training cmd: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 deepspeed main.py artifacts/config/config_uie.yaml --type ACTOR It seems that each GPU trains with different data. If the world size=6, the training of each GPU will end when Iteration:193/1162, instead of Iteration: 1162/1162. But I'm not quite sure how they communicate with each other, so maybe you are right? Thanks for pointing it out :) d904fcd207d98c8c5437cbd6015c616

Mar 14 '23 08:03 cmnfriend

@cmnfriend Great work, thanks for your help!! It would be amazing if you could open a PR with the updates that you have done to DeepSpeed!

Mar 14 '23 09:03 PierpaoloSorbellini

try to use bf16, work for me

By using bf16, although the training loss seemed normal, the parameters of the model turned to 0 after being saved. Did you meet with the same problem? I notice that before saving the model, dtype is bfloat16. while after saving it, dtype is float16. Maybe this is the problem?

The above picture shows the model parameters before running the code torch.save({"model": self.model.state_dict()}, path), and the picture below shows the model parameters after model.load_state_dict(checkpoint, strict=False).

da3afbb1e47395b8470784f505f071e ———————————————————————————————————————————— 3307eecd3200e61f51956868d5339d0

Mar 16 '23 00:03 cmnfriend

I know why... In actor.py, the checkpoint is saved as the value of the key "model", while it is not necessary when saving the LLaMA model.

So in llama_model.py, def load_checkpoints should be modified like this:

Mar 16 '23 02:03 cmnfriend

我知道为什么... 在中actor.py，检查点被保存为键“model”的值，而在保存 LLaMA 模型时则不需要。

所以在中llama_model.py，def load_checkpoints应该这样修改：

Thank you, after reading your answer, I successfully got the code to run on multiple GPUs, but got an error when saving (modified to the code you provided). May I ask if you have encountered it?

Mar 16 '23 03:03 81549361

@81549361 Nope...

Mar 16 '23 05:03 cmnfriend

I saw the poster's @cmnfriend changes about the load_checkpoints() function. It seems like each gpu has the whole model (which means we do not go model parallel for this case). @bnuzhanyu

Mar 17 '23 03:03 ZeyuTeng96

However, definitely, it is a nice project to show how to finetune 7B with multiple gpus with deepspeed. Thanks. In my case, I only have 2 A100 (80G) GPUs. I may need to consider how to implement a 'model parallel' with deepspeed for finetuning. @cmnfriend BTW, is that possible to have ur QQ account to ask for ur help.

Mar 17 '23 04:03 ZeyuTeng96

Or can we create a QQ chat group for this. Everybody can join in and solve problem together

Mar 17 '23 04:03 ZeyuTeng96

She seems to be Japanese, does Japan also use QQ?😊

Mar 17 '23 04:03 81549361

She seems to be Japanese, does Japan also use QQ?😊

But question is 'if u see the screenshot clearly, u can find that her terminal is Chinese'. LOL

Mar 17 '23 04:03 ZeyuTeng96

You have good eyesight!!

Mar 17 '23 04:03 81549361

You have good eyesight!!

Did you run it successfully? I think we can provide a QQ group chat account at this post for everybody join in?

Mar 17 '23 04:03 ZeyuTeng96

You have good eyesight!!

Did you run it successfully? I think we can provide a QQ group chat account at this post for everybody join in?

keyi!!

Mar 17 '23 04:03 81549361

You have good eyesight!!

Did you run it successfully? I think we can provide a QQ group chat account at this post for everybody join in?

keyi!!

So the QQ group chat number is: 397447632, welcome everybody for discussion

Mar 17 '23 04:03 ZeyuTeng96

You have good eyesight!!

Did you run it successfully? I think we can provide a QQ group chat account at this post for everybody join in?

keyi!!

So the QQ group chat number is: 397447632, welcome everybody for discussion

i have applied

Mar 17 '23 05:03 81549361

nebuly nebuly copied to clipboard

[Chatllama] Supervised Finetune on LLaMA-7B

nebuly
nebuly copied to clipboard