Open-Assistant Add reward model scoring during SFT evaluation

Currently we still need to manually run sampling_score.py on our sampling reports after training. In order to simplify the evaluation process and to get a score from our RM earlier during training this evaluation should be integrated directly in the SFT training process as an additional evaluation metric that is reported to wandb. Currently only classic evaluation is done which computes the loss and accuracy scores of an evaluation set.

For RL we trained several reward models which can be found on our Huggingface page: 1.4B pythia-based: oasst-rm-2.1-pythia-1.4b-epoch-2.5, 6.9B pythia-based: OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1. On the HF page a short snippet is shown how to load the RM and how to compute a reward score for samples.

Several things should be considered:

Memory usage: The reward model does not need to stay on the GPU the whole time, it could either be kept completely in CPU memory or only be loaded into the GPU for the eval step.
The RM model uses the tokenizer configuration of Pythia while we also train models with other tokenizers that have different EOS token representations (e.g. pythia <|endoftext|> vs. llama </s>). For a meaningful evaluation the special tokens of the trained model replaced to the pythia ones before tokenization. In general it is not possible to directly forward the token-ids but a token to text-conversion of the trained model need to happen followed by again a tokenization of this text to feed it into the RM.

If you are to the codebase looking at trainer_sft.py and checking how to add a custom step into the HF Trainer derived class would probably be a good first step.

Apr 18 '23 09:04 andreaskoepf

Working on this!

Apr 18 '23 09:04 ash3n

@ash3n Haven't heard back for a longer time. Are you still working on it or should we try to hand over the issue to someone else?

May 05 '23 10:05 andreaskoepf

waiting for updates of this IMPORTANT WORK

May 19 '23 10:05 sepilqi

Sorry guys, I placed this on a separate PR by accident. Was so tired that I was not thinking. 3260 https://github.com/LAION-AI/Open-Assistant/pull/3260

Jun 06 '23 01:06 rkchee

Open-Assistant Open-Assistant copied to clipboard

Add reward model scoring during SFT evaluation

Open-Assistant
Open-Assistant copied to clipboard