alpaca-lora
alpaca-lora copied to clipboard
Benchmarking and optimization tips
Not exactly an issue, but have just been trying to run one epoch of finetuning with llama-13b. On a 4090 looks like it will take roughly 4 hours with the setting `MICRO_BATCH_SIZE = 2'.
However, it looks like the loss already converged to ~1 within epoch 0.12 (roughly 30 minutes into training), so it doesn't really make sense to use epoch=3 and potentially a larger micro batch size.
I could be wrong here. Happy to hear some feedback on how to better tune the parameters.
Also, would be great if we can have 4bit support by incorporating GPTQ #2
With 256 tokens the loss slowly pulls further down to somewhere slightly above 0.8. You could maybe get away with using 2 epochs instead of 3, though.
With 256 tokens the loss slowly pulls further down to somewhere slightly above 0.8. You could maybe get away with using 2 epochs instead of 3, though.
Yeah I definitely saw it drop below 0.75 somewhere between epoch 1-2. Could still achieve pretty good loss with just one epoch though. Was testing this in a hurry so just sharing this information here.
did you get below 0.75 w current hyperparams? I wasnt able to get under 0.8 . wondering what others are getting (Im using A100 40GB)
I probably wouldn't anchor too much on the specific loss numbers until we've refactored the training code to use validation sets.