nebuly
nebuly copied to clipboard
[Chatllama]why use entropy as a part of loss
@PierpaoloSorbellini Hi, I’m confused about the policy loss in RLHF. Does it mean to minimize the entropy while updating the policy? If so, could you please refer me to some related algorithms/papers for further understanding? Thanks for the great work.
https://github.com/nebuly-ai/nebullvm/blob/ca085a979b5b596bf0ecd477e4c4deff3725661c/apps/accelerate/chatllama/chatllama/rlhf/trainer.py#L491-L516
Hi @HuangLK. Yes, here is the OpenAI paper on PPO. https://arxiv.org/pdf/1707.06347.pdf I think the main reason is that you want to keep the difference in update steps under control, namely you don't want the policy to change drastically between two update steps. As for the entropy itself, I just added it after consulting the paper, here I report what they say about it: "This goal can be further enhanced by adding an entropy bonus to ensure sufficient exploration, as suggested in previous work." However, it is quite easy to just remove the KL or the entropy from the loss (for now just by changing a few lines of code) and see the difference in behaviour. I hope my answer was helpful. Let me know what you think :)
If I understood correctly, the paper maximizes this entropy, but the code seems to minimize it.
Hi @HuangLK, good question! In the paper they maximize entropy, while in the code we minimize -entropy
which is mathematically equivalent to their approach. Clearly it is possible that we have made a sign error somewhere in the code, feel free to point it out and open a PR to correct it if it is the case 😄