nebuly
nebuly copied to clipboard
Give indication on the size of the dataset needed for fine-tuning the model
Description
Once of the biggest difficulty when selecting and cleaning the data for training is to estimate to correct amount of data needed for training the model.
ChatLLaMA training and RLHF in general are quite early-technologies, not deeply studied by the literature. We should implement a function to be used as “rule of thumb” for getting an estimation of the needed data from the model size.
We can extract the law from the Scaling law papers combined with OpenAI’s InstructGPT paper.
TODO
- [ ] Implement a rule-of-thumb for estimating the data needed
- [ ] Validate the assumption on a small model