rlhf-book
rlhf-book copied to clipboard
Chapter Plans
Here is a rough outline of what I would like to see in the book, and who will be writing it.
Introductions & History
- Introduction
- Economics, Psychology, Philosophy of preference, etc.: VNM Theory, Bradley Terry, Impossibility theorems, social choice, etc
- Optimal Control, Deep RL, ML etc.
- RLHF for LLM lit (pre chatgpt stuff), maybe summarize instrugpt
Links:
- https://arxiv.org/abs/2310.13595
Problem Specification
- Definitions, basic stuff, math
- Preference data collection
- Preference model training
- KL constraints and other penalties
Policy Optimization
- IFT / SFT / Chat Templates
- Rejection Sampling / Best of N
- PPO, REINFORCE, Policy Gradient
- DPO (Eric, Archit, Rafael)
- Other variants (short)
Advanced (optional)
- CAI
- Synthetic vs human data
- Evaluation
Open Questions (TBD / optional)
- Reward model over-optimization