Open-Assistant
Open-Assistant copied to clipboard
ML Overview [temporary coordination issue, will be split up]
Action Plan for ML-Team
1. Data mixes
- [ ] create a list of all datasets under consideration for OA SFT, identify datasets that need further processing (e.g. multi-turn and need to be converted to OA jsonl format), list will be created as OA SFT Dataset Quality & Data Mix sheet.
- [ ] write loaders, make sure all datasets can be loaded
- [ ] generating dataset statistics (number of messages, number of turns in conversations)
- [ ] manually asses the quality (subjective opinion) of a sampled subsets of the datasets
- [ ] determine fraction of each dataset to be used for SFT (e.g. which language, how many messages), goal: balanced dataset
- [ ] prepare two-stage training configuration: stage 1: wide dataset mix (including potentially lower quality data); stage 2: fine-tuning on (smaller) high quality dataset (i.e. best data from OIG & OA only)
- [ ] test sampling and inspect batches
2. Tokenization
- [ ] end all assistant messages with special
<User>token (v2 format) - [ ] check options to tokenize numbers as single tokens per digit and run experiment to assess "math skills" (standard tokenization vs. pre-digit tokenization)
3. Evaluation
- [ ] create list of most useful benchmarks to run for our model (e.g. to measure "fine-tuning tax", loss of academic benchmark scores)
- [ ] create a script to run automatic benchmarks for our model
- [ ] implement one useful benchmark, e.g. 'wizard of wikipedia' or 'wizard of the internet'. (see https://github.com/LAION-AI/Open-Assistant/issues/1908)
- [ ] create a RM manual evaluation script as sanity check, similar to sampling-report.py
- [ ] adapt
sampling-report.pyscript to stop sampling on<user>tokens
4. Reward model
- [ ] review our reward model code, do we have an impl that matches the loss in the InstructGPT paper? What are possible improvements?
- [ ] discuss options to incorporate additional OA-data beside ranks: labels, emojis, deleted-status. dataset for RM training
- [ ] run evaluation, e.g. what is the accuracy of the RM on test-set?
5. Synthetic assistant replies
- [ ] generating multiple diverse synthetic replies for prompts (e.g. filtering with Reward model)
- [ ] filter reply candidates with reward model, load them in DB for ranking by humans
- [ ] write script to compare actual user rankings and RM predictions
6. RL finetuning
- [ ] first sucessful fine-tuning of OA SFT model with trlx that shows significant reward improvements
7. Storage
- [ ] determine where to (temporarily) store trained models for evaluation
check options to tokenize numbers as single tokens per digit and run experiment to assess "math skills" (standard tokenization vs. pre-digit tokenization)
I'd like to take this please
For 3. Evaluation
The repo can be a benchmark with 23 datasets and they had tested against ChatGPT.
Looks like a good framework and baseline for us to start with, and we can enrich the datasets and tasks.
I'd be glad if I can take or join the RL finetuning part. I guess there must be a leader for this task?
I actually can participate in most of the above tasks, though I prefer to contribute to the RL part.
When will these be split up and assigned? @andreaskoepf ?
I am closing this to reduce confusion since we are effectively following a very different - much simpler plan.