RRHF This loss seems to consume a lot of memory.

This loss seems to consume a lot of memory.

Open piekey1994 opened this issue 1 year ago • 4 comments

The idea of this paper is really great and much easier to understand than ppo. However, if there are six candidate responses, then at least batch size should be equal to 6 when calculating loss once. If the model scale is large, it seems difficult for a GPU to support a forward operation. I think the tokens generated in the paper has been cut to 192, which is far lower than the 2048 configured in ordinary LLM training. Is this also the reason? Is there any optimization strategy to solve this problem? For example, a step only calculates the rank loss of a pair of responses and the sft loss of the best response of the current pair group. I don't know if this is feasible

Apr 14 '23 14:04 piekey1994

RRHF RRHF copied to clipboard

This loss seems to consume a lot of memory.

RRHF
RRHF copied to clipboard