DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

step3_rlhf_finetuning may needs two tokenizers ?

Open xiangrongzeng opened this issue 2 years ago • 13 comments

In step3 for rlhf finetuning, there is an actor and a critic. The actor and critic may required different tokenizers. For example, the actor is opt-1.3B, while the critic is bloom. However, only one tokenizer in the code. I'm wondering if I understand rlhf wrong or this is indeed a bug here.

xiangrongzeng avatar Apr 13 '23 03:04 xiangrongzeng

Hi @xiangrongzeng, this is a good point. Using models from different model families will require two tokenizer. For this release, we did not add this support since then we need to de-tokenizer the output from actor-generated sentences and then again tokenize it for, in your case, critic model. We will discuss this feature internally.

Also, note that we are very welcome users, like you, to create PR and support this case :).

yaozhewei avatar Apr 13 '23 15:04 yaozhewei

It is more general that actor and critic are two completely different models, and it is only a special case that the two models are the same. Hope to realize the tokenizer code that can load two different models, thanks in advance

guijuzhejiang avatar Apr 14 '23 02:04 guijuzhejiang

We will discuss internally about the request. We are welcome users to contribute as well :)

yaozhewei avatar Apr 18 '23 17:04 yaozhewei

@guijuzhejiang The author's suggestion is that both models belong to the same model family, as confirmed in the paper "instructgpt."

JingerAI avatar Apr 19 '23 04:04 JingerAI

@JingerAI Yes, same model family and different parameters model is used in the paper. But theoretically the SFT and RM models can be any model. I don't think it is necessary to use a large language model for the RM model when resources are limited.Especially for training models in other languages, because the choice of pre-trained models is very limited.

guijuzhejiang avatar Apr 20 '23 01:04 guijuzhejiang

Using the same tokenizer for actor and critic in step3 is beneficial. Considering that RM model is easier to train, in step2, I try to use the actor tokenizer during the training of RM model, even if the RM model is from different model family. Therefore, in step3, the actor and critic could come from different model family but sharing the same tokenizer.

In practice, I train an opt-350m RM model with llama tokenizer. It works. But I haven't try step3 yet.

xiangrongzeng avatar Apr 20 '23 02:04 xiangrongzeng

@xiangrongzeng Thanks for this idea,but if the pretrained opt-350m RM model is finetuninged with llama tokenizer, I think the pretrained opt-350m RM model parameters are useless.Please correct me if I am wrong

guijuzhejiang avatar Apr 20 '23 03:04 guijuzhejiang

Using the same tokenizer for actor and critic in step3 is beneficial. Considering that RM model is easier to train, in step2, I try to use the actor tokenizer during the training of RM model, even if the RM model is from different model family. Therefore, in step3, the actor and critic could come from different model family but sharing the same tokenizer.

In practice, I train an opt-350m RM model with llama tokenizer. It works. But I haven't try step3 yet.

how to save llama7b model ? we use original function in deepspeed for loading,but Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /path and are newly initialized: ['model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /mnt/yutou/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/llama7b/iter1_score69336.5390625 and are newly initialized: ['model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Pattaro avatar Apr 20 '23 03:04 Pattaro

@guijuzhejiang Yes, the pre-trained parameters may not helpful in this case. By doing this, my assumption is that RM is easier to train, which not heavily rely on the pre-trained parameters. We may need more experiments to find out whether it's ok or not.

xiangrongzeng avatar Apr 20 '23 07:04 xiangrongzeng

@Pattaro This project is currently not support llama yet. I used the llama tokenizer with OPT model. According to the introduction, the llama model will soon be supported officially.

xiangrongzeng avatar Apr 20 '23 07:04 xiangrongzeng

@xiangrongzeng Will the project support GLM? It is a benefit for the Chinese.

JingerAI avatar Apr 20 '23 07:04 JingerAI

@JingerAI please create new issue/request for supporting GLM :)

yaozhewei avatar Apr 20 '23 17:04 yaozhewei

@guijuzhejiang Hi, check this paper https://arxiv.org/pdf/2304.08177.pdf

We initialize the Chinese-LLaMA model with the original LLaMA weights and pre-train the model on general Chinese corpora

xiangrongzeng avatar Apr 24 '23 09:04 xiangrongzeng

@xiangrongzeng Thanks,great.Do you try step3 PPO training?The data collection stage of experiment is extremely time-consuming, and the performance of the training stage is also very unstable

guijuzhejiang avatar Apr 25 '23 01:04 guijuzhejiang

Closed as no followup

yaozhewei avatar May 05 '23 18:05 yaozhewei