trl
trl copied to clipboard
Does AutoModelForCausalLMWithValueHead get abandoned in PPOv2Trainer ?
System Info
I saw that in the PPOv2 example, the policy model is directly created from AutoModelForCausalLM.from_pretrained
I want to know if it is interchangeable with AutoModelForCausalLMWithValueHead.from_pretrained
Also I found that if I use AutoModelForCausalLMWithValueHead in PPOv2 I will have faster ppo training speed compared to using AutoModelForCausalLM . I wonder why this happened.
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder - [ ] My own task or dataset (give details below)
Reproduction
change the creation of policy model from calling AutoModelForCausalLM to AutoModelForCausalLMWithValueHead
Expected behavior
I would like to know what causes the difference in performance.