Action Plan for ML-Team

1. Data mixes

[ ] create a list of all datasets under consideration for OA SFT, identify datasets that need further processing (e.g. multi-turn and need to be converted to OA jsonl format), list will be created as OA SFT Dataset Quality & Data Mix sheet.
[ ] write loaders, make sure all datasets can be loaded
[ ] generating dataset statistics (number of messages, number of turns in conversations)
[ ] manually asses the quality (subjective opinion) of a sampled subsets of the datasets
[ ] determine fraction of each dataset to be used for SFT (e.g. which language, how many messages), goal: balanced dataset
[ ] prepare two-stage training configuration: stage 1: wide dataset mix (including potentially lower quality data); stage 2: fine-tuning on (smaller) high quality dataset (i.e. best data from OIG & OA only)
[ ] test sampling and inspect batches

2. Tokenization

[ ] end all assistant messages with special <User> token (v2 format)
[ ] check options to tokenize numbers as single tokens per digit and run experiment to assess "math skills" (standard tokenization vs. pre-digit tokenization)

3. Evaluation

[ ] create list of most useful benchmarks to run for our model (e.g. to measure "fine-tuning tax", loss of academic benchmark scores)
[ ] create a script to run automatic benchmarks for our model
[ ] implement one useful benchmark, e.g. 'wizard of wikipedia' or 'wizard of the internet'. (see https://github.com/LAION-AI/Open-Assistant/issues/1908)
[ ] create a RM manual evaluation script as sanity check, similar to sampling-report.py
[ ] adapt sampling-report.py script to stop sampling on <user> tokens

4. Reward model

[ ] review our reward model code, do we have an impl that matches the loss in the InstructGPT paper? What are possible improvements?
[ ] discuss options to incorporate additional OA-data beside ranks: labels, emojis, deleted-status. dataset for RM training
[ ] run evaluation, e.g. what is the accuracy of the RM on test-set?

5. Synthetic assistant replies

[ ] generating multiple diverse synthetic replies for prompts (e.g. filtering with Reward model)
[ ] filter reply candidates with reward model, load them in DB for ranking by humans
[ ] write script to compare actual user rankings and RM predictions

6. RL finetuning

[ ] first sucessful fine-tuning of OA SFT model with trlx that shows significant reward improvements

7. Storage

[ ] determine where to (temporarily) store trained models for evaluation

Mar 05 '23 18:03 andreaskoepf

check options to tokenize numbers as single tokens per digit and run experiment to assess "math skills" (standard tokenization vs. pre-digit tokenization)

I'd like to take this please

Mar 05 '23 23:03 johnflux

For 3. Evaluation The repo can be a benchmark with 23 datasets and they had tested against ChatGPT. Looks like a good framework and baseline for us to start with, and we can enrich the datasets and tasks.

Mar 06 '23 04:03 kenhktsui

I'd be glad if I can take or join the RL finetuning part. I guess there must be a leader for this task?

I actually can participate in most of the above tasks, though I prefer to contribute to the RL part.

Mar 08 '23 03:03 totuta

When will these be split up and assigned? @andreaskoepf ?

Mar 09 '23 21:03 totuta

I am closing this to reduce confusion since we are effectively following a very different - much simpler plan.

Mar 12 '23 21:03 andreaskoepf

Open-Assistant
Open-Assistant copied to clipboard

ML Overview [temporary coordination issue, will be split up]

Action Plan for ML-Team

1. Data mixes

2. Tokenization

3. Evaluation

4. Reward model

5. Synthetic assistant replies

6. RL finetuning

7. Storage

Open-Assistant Open-Assistant copied to clipboard

ML Overview [temporary coordination issue, will be split up]

Action Plan for ML-Team

1. Data mixes

2. Tokenization

3. Evaluation

4. Reward model

5. Synthetic assistant replies

6. RL finetuning

7. Storage

Open-Assistant
Open-Assistant copied to clipboard