Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

ML Overview [temporary coordination issue, will be split up]

Open andreaskoepf opened this issue 2 years ago • 2 comments

Action Plan for ML-Team

1. Data mixes

  • [ ] create a list of all datasets under consideration for OA SFT, identify datasets that need further processing (e.g. multi-turn and need to be converted to OA jsonl format), list will be created as OA SFT Dataset Quality & Data Mix sheet.
  • [ ] write loaders, make sure all datasets can be loaded
  • [ ] generating dataset statistics (number of messages, number of turns in conversations)
  • [ ] manually asses the quality (subjective opinion) of a sampled subsets of the datasets
  • [ ] determine fraction of each dataset to be used for SFT (e.g. which language, how many messages), goal: balanced dataset
  • [ ] prepare two-stage training configuration: stage 1: wide dataset mix (including potentially lower quality data); stage 2: fine-tuning on (smaller) high quality dataset (i.e. best data from OIG & OA only)
  • [ ] test sampling and inspect batches

2. Tokenization

  • [ ] end all assistant messages with special <User> token (v2 format)
  • [ ] check options to tokenize numbers as single tokens per digit and run experiment to assess "math skills" (standard tokenization vs. pre-digit tokenization)

3. Evaluation

  • [ ] create list of most useful benchmarks to run for our model (e.g. to measure "fine-tuning tax", loss of academic benchmark scores)
  • [ ] create a script to run automatic benchmarks for our model
  • [ ] implement one useful benchmark, e.g. 'wizard of wikipedia' or 'wizard of the internet'. (see https://github.com/LAION-AI/Open-Assistant/issues/1908)
  • [ ] create a RM manual evaluation script as sanity check, similar to sampling-report.py
  • [ ] adapt sampling-report.py script to stop sampling on <user> tokens

4. Reward model

  • [ ] review our reward model code, do we have an impl that matches the loss in the InstructGPT paper? What are possible improvements?
  • [ ] discuss options to incorporate additional OA-data beside ranks: labels, emojis, deleted-status. dataset for RM training
  • [ ] run evaluation, e.g. what is the accuracy of the RM on test-set?

5. Synthetic assistant replies

  • [ ] generating multiple diverse synthetic replies for prompts (e.g. filtering with Reward model)
  • [ ] filter reply candidates with reward model, load them in DB for ranking by humans
  • [ ] write script to compare actual user rankings and RM predictions

6. RL finetuning

  • [ ] first sucessful fine-tuning of OA SFT model with trlx that shows significant reward improvements

7. Storage

  • [ ] determine where to (temporarily) store trained models for evaluation

andreaskoepf avatar Mar 05 '23 18:03 andreaskoepf

check options to tokenize numbers as single tokens per digit and run experiment to assess "math skills" (standard tokenization vs. pre-digit tokenization)

I'd like to take this please

johnflux avatar Mar 05 '23 23:03 johnflux

For 3. Evaluation The repo can be a benchmark with 23 datasets and they had tested against ChatGPT. Looks like a good framework and baseline for us to start with, and we can enrich the datasets and tasks.

kenhktsui avatar Mar 06 '23 04:03 kenhktsui

I'd be glad if I can take or join the RL finetuning part. I guess there must be a leader for this task?

I actually can participate in most of the above tasks, though I prefer to contribute to the RL part.

totuta avatar Mar 08 '23 03:03 totuta

When will these be split up and assigned? @andreaskoepf ?

totuta avatar Mar 09 '23 21:03 totuta

I am closing this to reduce confusion since we are effectively following a very different - much simpler plan.

andreaskoepf avatar Mar 12 '23 21:03 andreaskoepf