DeepSpeedExamples
DeepSpeedExamples copied to clipboard
step3_rlhf_finetuning may needs two tokenizers ?
In step3 for rlhf finetuning, there is an actor and a critic. The actor and critic may required different tokenizers. For example, the actor is opt-1.3B, while the critic is bloom. However, only one tokenizer in the code. I'm wondering if I understand rlhf wrong or this is indeed a bug here.
Hi @xiangrongzeng, this is a good point. Using models from different model families will require two tokenizer. For this release, we did not add this support since then we need to de-tokenizer the output from actor-generated sentences and then again tokenize it for, in your case, critic model. We will discuss this feature internally.
Also, note that we are very welcome users, like you, to create PR and support this case :).
It is more general that actor and critic are two completely different models, and it is only a special case that the two models are the same. Hope to realize the tokenizer code that can load two different models, thanks in advance
We will discuss internally about the request. We are welcome users to contribute as well :)
@guijuzhejiang The author's suggestion is that both models belong to the same model family, as confirmed in the paper "instructgpt."
@JingerAI Yes, same model family and different parameters model is used in the paper. But theoretically the SFT and RM models can be any model. I don't think it is necessary to use a large language model for the RM model when resources are limited.Especially for training models in other languages, because the choice of pre-trained models is very limited.
Using the same tokenizer for actor and critic in step3 is beneficial. Considering that RM model is easier to train, in step2, I try to use the actor tokenizer during the training of RM model, even if the RM model is from different model family. Therefore, in step3, the actor and critic could come from different model family but sharing the same tokenizer.
In practice, I train an opt-350m RM model with llama tokenizer. It works. But I haven't try step3 yet.
@xiangrongzeng Thanks for this idea,but if the pretrained opt-350m RM model is finetuninged with llama tokenizer, I think the pretrained opt-350m RM model parameters are useless.Please correct me if I am wrong
Using the same tokenizer for actor and critic in step3 is beneficial. Considering that RM model is easier to train, in step2, I try to use the actor tokenizer during the training of RM model, even if the RM model is from different model family. Therefore, in step3, the actor and critic could come from different model family but sharing the same tokenizer.
In practice, I train an opt-350m RM model with llama tokenizer. It works. But I haven't try step3 yet.
how to save llama7b model ? we use original function in deepspeed for loading,but Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /path and are newly initialized: ['model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /mnt/yutou/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/llama7b/iter1_score69336.5390625 and are newly initialized: ['model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
@guijuzhejiang Yes, the pre-trained parameters may not helpful in this case. By doing this, my assumption is that RM is easier to train, which not heavily rely on the pre-trained parameters. We may need more experiments to find out whether it's ok or not.
@Pattaro This project is currently not support llama yet. I used the llama tokenizer with OPT model. According to the introduction, the llama model will soon be supported officially.
@xiangrongzeng Will the project support GLM? It is a benefit for the Chinese.
@JingerAI please create new issue/request for supporting GLM :)
@guijuzhejiang Hi, check this paper https://arxiv.org/pdf/2304.08177.pdf
We initialize the Chinese-LLaMA model with the original LLaMA weights and pre-train the model on general Chinese corpora
@xiangrongzeng Thanks,great.Do you try step3 PPO training?The data collection stage of experiment is extremely time-consuming, and the performance of the training stage is also very unstable
Closed as no followup