PaLM-rlhf-pytorch
PaLM-rlhf-pytorch copied to clipboard
Implementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture. Basically ChatGPT but with PaLM
Hi, I am confused about the 'value function' in the instructGPT paper. In the paper, it said "As previously mentioned, for all PPO models we use a 6B RM and...
Hi, I am confused that the loss function of ChatGPT's reward model takes as input the difference of two responses and then passes a sigmoid function. However, the loss function...
https://github.com/lucidrains/PaLM-rlhf-pytorch/blob/6b02ee329106baff78e293afa7d1d2e6dd4e5ca2/palm_rlhf_pytorch/utils.py#L60 Using the sorted indices to index the sorted indices does not make sense. I think it may be `return logits.scatter(1, sorted_indices, sorted_logits)`