nebuly icon indicating copy to clipboard operation
nebuly copied to clipboard

Give indication on the size of the dataset needed for fine-tuning the model

Open diegofiori opened this issue 1 year ago • 0 comments

Description

Once of the biggest difficulty when selecting and cleaning the data for training is to estimate to correct amount of data needed for training the model.

ChatLLaMA training and RLHF in general are quite early-technologies, not deeply studied by the literature. We should implement a function to be used as “rule of thumb” for getting an estimation of the needed data from the model size.

We can extract the law from the Scaling law papers combined with OpenAI’s InstructGPT paper.

TODO

  • [ ] Implement a rule-of-thumb for estimating the data needed
  • [ ] Validate the assumption on a small model

diegofiori avatar Mar 08 '23 13:03 diegofiori