OpenRLHF Forced EOS token in vllm generation?

I see in RemoteExperienceMaker._generate_vllm(), line 375 that for generations that don't finish, i.e. don't output the EOS tokens within the max token limit, we manually set the last token to be the EOS token, even though that was not what the model generated.

Isn't this the wrong thing to do? E.g. if the model generated an unfinished sentence like "This is an unfinished" when it ran into the token limit, shouldn't we train on that, rather than "This is an "? My understanding of the PPO algorithm is also that it doesn't do well with off-policy experiences, which we technically have if we manually change to the EOS token. So I just wanted to check if there's a specific reason to do this?

It also looks to me that the huggingface model.generate() method, and by extension RemoteExperienceMaker._generate_local() and NaiveExperienceMaker do not do this.

Mar 08 '24 06:03 mgerstgrasser

Because the reward model uses the EOS token to predict the reward value. So we had to hack it.

Mar 08 '24 06:03 hijkzzz

Ahhhh, got it, that makes sense. I think that's probably broken with local generation then! I just verified that that doesn't have EOS if max_tokens is reached.

Also, would taking the RM output on the last non-masked token instead of EOS be a better way around this? Or alternatively, forcing the EOS token only when feeding the experience to the RM?

Mar 08 '24 06:03 mgerstgrasser

I am not sure which approach to take at the moment, but our current implementation is heavily dependent on EOS tokens.

Mar 08 '24 08:03 hijkzzz

I am not sure which approach to take at the moment, but our current implementation is heavily dependent on EOS tokens.

You mean specifically for the RM? Or more broadly than that?

Mar 08 '24 19:03 mgerstgrasser

Oh, for local generation, does actor.process_sequences() do the same thing? https://github.com/OpenLLMAI/OpenRLHF/blob/bed10e115523a9eca419cb058ede8e531d23c182/openrlhf/models/actor.py#L159

If so, then doing this in RemoteExperienceMaker seems unnecessary, since that also calls actor.process_sequences() later anyway, i.e. this is being done twice, I think?

Mar 13 '24 16:03 mgerstgrasser

@hijkzzz Could I ask a quick related question: In actor.process_sequences() I also see that attention_mask is set to False on all EOS tokens, except the final EOS token in each sequence. In the datasets used in the examples, it seems that there never is an EOS token in the input, but if there was (e.g. in a previous turn in a conversation), shouldn't the attention mask be True there?

Mar 14 '24 01:03 mgerstgrasser

@hijkzzz Could I ask a quick related question: In actor.process_sequences() I also see that attention_mask is set to False on all EOS tokens, except the final EOS token in each sequence. In the datasets used in the examples, it seems that there never is an EOS token in the input, but if there was (e.g. in a previous turn in a conversation), shouldn't the attention mask be True there?

You need to change to a new special token

Sep 08 '24 04:09 pagepal666

OpenRLHF OpenRLHF copied to clipboard

Forced EOS token in vllm generation?

OpenRLHF
OpenRLHF copied to clipboard