litgpt
litgpt copied to clipboard
LoRA: `zero_pad` speed improvements
While experimenting with GitHub actions I deleted my fork (I know, I know) and thus all opened PRs were automatically closed. This PR mirrors #630.
Hi there 👋
This PR is a result of #461.
In that issue I've found out that the creation of a new tensor with lora_ind
(that are stored as a python list on a CPU) for each zero_pad
call ...
https://github.com/Lightning-AI/lit-gpt/blob/807c7bc17413d53961f96dc668aa03c0b970a43f/lit_gpt/lora.py#L293-L295
... implicitly calls cudaStreamSynchronize
every time and that slows down the forward pass a bit.
Traces
[!NOTE] Number are provided for the
Nvidia T4
and16-mixed
precision.
Let's take a look at the traces for Pythia-410m
.
Currently zero_pad
takes a significant part of the time:
[!NOTE] Compare the size of
cudaStreamSynchronize
from the screenshot above (CUDA 12.1) and the one from the "Performance Study" issue (CUDA 11.8) - it's much smaller thanks to the newest CUDA.
After the code is optimized, from the trace we can see that the zero_pad
now takes much less portion of the time:
In numbers, it's 830 μs
vs 126 μs
.
LoRA fine-tuning
If to compare LoRA fine-tuning with Pythia-410m
and 1k iterations
, we have:
Model | Loss $_{control}$ | Loss $_{test}$ | Time $_{control}$ | Time $_{test}$ |
---|---|---|---|---|
Pythia-70m | 2.5835 | 2.5802 | 30.90 | 28.51 |
Pythia-410m | 1.7976 | 1.7976 | 124.63 | 114.51 |
Not a drastic difference, but still a nice optimization.