Group data by length to reduce wasted computation
Informed by this post, I implemented the group_by_length feature in finetuning scripts and found on single V100 GPU finetune/lora.py LLaMA-7B with the same hyper parameters:
- Without
group_by_length: 8 hours, 37 minutes, 1.353 seconds - With
group_by_length: 6 hours, 15 minutes, 24.636 seconds
So the time saving is about 27%. I can submit a PR if anyone is interested.
That's pretty cool! What do you think @lantiga & @carmocca ?
It's a well-known trick for training with variable length sequences. Sometimes it can impact the loss as it can impact the i.i.d property of ml training depending on your data. @nuance1979 Do you reach the same loss with both runs? How do the loss curves compare?
I think we could add this technique, but it should be done in a way that's easy to enable or disable. It would be useful to see your code, so feel free to open the PR! Thank you
I did not evaluate the fine-tuned models carefully but the losses are in the ballpark and the learning curves are essentially the same shape. In general, it depends on the downstream tasks whether it makes a difference.
Sure. I made it optional and defaults to off. See #398 .