lit-llama Group data by length to reduce wasted computation

Informed by this post, I implemented the group_by_length feature in finetuning scripts and found on single V100 GPU finetune/lora.py LLaMA-7B with the same hyper parameters:

Without group_by_length: 8 hours, 37 minutes, 1.353 seconds
With group_by_length: 6 hours, 15 minutes, 24.636 seconds

So the time saving is about 27%. I can submit a PR if anyone is interested.

Jun 18 '23 03:06 nuance1979

That's pretty cool! What do you think @lantiga & @carmocca ?

Jun 18 '23 15:06 rasbt

It's a well-known trick for training with variable length sequences. Sometimes it can impact the loss as it can impact the i.i.d property of ml training depending on your data. @nuance1979 Do you reach the same loss with both runs? How do the loss curves compare?

I think we could add this technique, but it should be done in a way that's easy to enable or disable. It would be useful to see your code, so feel free to open the PR! Thank you

Jun 19 '23 10:06 carmocca

I did not evaluate the fine-tuned models carefully but the losses are in the ballpark and the learning curves are essentially the same shape. In general, it depends on the downstream tasks whether it makes a difference.

Sure. I made it optional and defaults to off. See #398 .

Jun 19 '23 17:06 nuance1979