RRHF
RRHF copied to clipboard
This loss seems to consume a lot of memory.
The idea of this paper is really great and much easier to understand than ppo. However, if there are six candidate responses, then at least batch size should be equal to 6 when calculating loss once. If the model scale is large, it seems difficult for a GPU to support a forward operation. I think the tokens generated in the paper has been cut to 192, which is far lower than the 2048 configured in ordinary LLM training. Is this also the reason? Is there any optimization strategy to solve this problem? For example, a step only calculates the rank loss of a pair of responses and the sft loss of the best response of the current pair group. I don't know if this is feasible