alpaca-lora icon indicating copy to clipboard operation
alpaca-lora copied to clipboard

[RFC] Experimental dynamic batching

Open SharkWipf opened this issue 1 year ago • 2 comments

I am a complete beginner to this, so my understanding may be completely wrong. This PR should enable dynamic batching, making the CUTOFF_LEN a maximum, rather than a hard limit. Enabling dynamic batching seems to significantly reduce training times, at the cost of losing input randomization due to the grouping of the training data and losing accurate ETAs. It also allows one to run with i.e. CUTOFF_LEN = 1024 (if you have enough VRAM for that, seems like there's memory optimizations in newer PyTorch), allowing you to fit the entire dataset without truncation, without the massive slowdowns you normally get with a high sequence length.

This PR does just the bare minimum to enable dynamic batching, and I have yet to observe the results of a fully trained model (it still takes hours to train, just slightly fewer...), and I don't even know if it makes sense to begin with, but I figured I'd make a PR so others can give their thoughts.

SharkWipf avatar Mar 23 '23 18:03 SharkWipf

I noticed this change is explicitly incompatible with, and in fact pretty much the polar opposite of, #140. I don't understand the context of either change well enough to offer any meaningful insights beyond that.

All I can say is that current on-going training with this PR seems to be going about twice as fast as without, while having a (much) larger sequence length. Comparing at the same sequence length I'm running at (1024), the difference seems to be 35 hours vs 6 hours on my 3090. I do not yet know the impact on training quality, if any, yet though.

If both PRs add meaningful improvements that aren't at the cost of quality, it might be worth making them configurable, although that'd expand the complexity of this project quite a bit, given that it's currently a single static file for finetuning.

EDIT: Correction, #140 seems to be achieving largely the same goal, through different means. Probably better, too.

SharkWipf avatar Mar 23 '23 20:03 SharkWipf

I noticed this change is explicitly incompatible with, and in fact pretty much the polar opposite of, #140. I don't understand the context of either change well enough to offer any meaningful insights beyond that.

All I can say is that current on-going training with this PR seems to be going about twice as fast as without, while having a (much) larger sequence length. Comparing at the same sequence length I'm running at (1024), the difference seems to be 35 hours vs 6 hours on my 3090. I do not yet know the impact on training quality, if any, yet though.

If both PRs add meaningful improvements that aren't at the cost of quality, it might be worth making them configurable, although that'd expand the complexity of this project quite a bit, given that it's currently a single static file for finetuning.

EDIT: Correction, #140 seems to be achieving largely the same goal, through different means. Probably better, too.

I think that both approaches are compatible

If you have a batch size of inputs with lengths [56, 34, 12, 58] My PR pads each batch using the input to the minimum common length multiple of 8: 64. Instead of an input of 1024 tokens for this batch, you will have an input of size [64,64,64,64], which will be orders of magnitude faster.

Dynamic batching, will group the sequences in the batch by len. So instead of having very diverse lengths such as [56, 34, 12, 58] you will have very similar lengths, this reduces the amount of PAD tokens required. We can combine both PR. In any case, I will add dynamic batching as a feature that can be turned on and of, because it reduces the "randomness" in the input order.

ikergarcia1996 avatar Mar 23 '23 22:03 ikergarcia1996

Merged in #146

tloen avatar Mar 24 '23 19:03 tloen