composable-sft
composable-sft copied to clipboard
Language SFT training
For language SFT training for english, did you use the entire wikipedia or just a subset of it?
We used the entire wikipedia, but the length of training was less than a full epoch, so in a sense we used a randomly selected subset.
Was there any rationale behind early stopping, like MLM accuracy or something or it was just random? I am asking because I wanted to know how much training data is enough training data for MLM, especially for high resource languages like en?