Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Train a reward model based on Instructor

Open andreaskoepf opened this issue 2 years ago • 5 comments

Add a scalar last-token reward-head to Instructor and train it on human-feedback pairs (good-bad) of the openai/summarize-from-feedback dataset (see the Learning to summarize from human feedback paper for details about the objective).

  • place your training code in a new model/reward/instructor folder
  • please use wandb for experiment tracking, measure at least loss + accuracy (based on score of good example > bad example)
  • try to avoid modifying the original model, if possible aggregate the existing model (i.e. add the existing model as a member of the new model class)
  • compare with results from #78

Background: We want to implement the RLHF stack for Open-Assistant in parallel to our data collection effort. As a temporary fill-in we use existing RLHF datasets like OpenAI's learning to summarize for model development. Instructor was proposed as a promising base-model candidate for a reward model.

You could use bits of reward model training code that I trained a couple of weeks ago which contains data loading code for the summarize-from-feedback data as inspirations. If you like, you can of course use of a framework like pytorch_lightning.

andreaskoepf avatar Dec 27 '22 07:12 andreaskoepf

I had one here, would be glad to contribute

theblackcat102 avatar Dec 27 '22 08:12 theblackcat102

I had one here, would be glad to contribute

Nice, I assigned the issue to you.

Since you already trained a RM based on bigscience/bloomz-560m .. do you think you could add loading code (e.g. see my linked code above) for the OpenAI summaries and train it on them? That would give us a datapoint for another RM. Did you record training metrics with wandb? Could you make it public?

andreaskoepf avatar Dec 27 '22 08:12 andreaskoepf

Sure! I will train a few variety of models and push to huggingface (if thats fine). I already had webgpt rm model on huggingface

theblackcat102 avatar Dec 27 '22 08:12 theblackcat102

We currently have webgpt, antrophic, summarization, xp3 and unnatural instructions as possible datasets for reward model evaluation. I suggest that we get some data for summarization first on all models. If you ore someone else has time to test the other datasets that would be great too.

andreaskoepf avatar Dec 27 '22 10:12 andreaskoepf

I will train a few variety of models and push to huggingface (if thats fine).

Of course! Could you submit a PR with your training code so that we have it in the repo?

andreaskoepf avatar Dec 28 '22 20:12 andreaskoepf