lit-llama icon indicating copy to clipboard operation
lit-llama copied to clipboard

Group data by length to reduce wasted computation

Open nuance1979 opened this issue 2 years ago • 3 comments

Informed by this post, I implemented the group_by_length feature in finetuning scripts and found on single V100 GPU finetune/lora.py LLaMA-7B with the same hyper parameters:

  • Without group_by_length: 8 hours, 37 minutes, 1.353 seconds
  • With group_by_length: 6 hours, 15 minutes, 24.636 seconds

So the time saving is about 27%. I can submit a PR if anyone is interested.

nuance1979 avatar Jun 18 '23 03:06 nuance1979

That's pretty cool! What do you think @lantiga & @carmocca ?

rasbt avatar Jun 18 '23 15:06 rasbt

It's a well-known trick for training with variable length sequences. Sometimes it can impact the loss as it can impact the i.i.d property of ml training depending on your data. @nuance1979 Do you reach the same loss with both runs? How do the loss curves compare?

I think we could add this technique, but it should be done in a way that's easy to enable or disable. It would be useful to see your code, so feel free to open the PR! Thank you

carmocca avatar Jun 19 '23 10:06 carmocca

I did not evaluate the fine-tuned models carefully but the losses are in the ballpark and the learning curves are essentially the same shape. In general, it depends on the downstream tasks whether it makes a difference.

Sure. I made it optional and defaults to off. See #398 .

nuance1979 avatar Jun 19 '23 17:06 nuance1979