litgpt LoRA: `zero_pad` speed improvements

LoRA: `zero_pad` speed improvements

Open Andrei-Aksionov opened this issue 1 year ago • 0 comments

While experimenting with GitHub actions I deleted my fork (I know, I know) and thus all opened PRs were automatically closed. This PR mirrors #630.

Hi there 👋

This PR is a result of #461. In that issue I've found out that the creation of a new tensor with lora_ind (that are stored as a python list on a CPU) for each zero_pad call ... https://github.com/Lightning-AI/lit-gpt/blob/807c7bc17413d53961f96dc668aa03c0b970a43f/lit_gpt/lora.py#L293-L295

... implicitly calls cudaStreamSynchronize every time and that slows down the forward pass a bit.

Traces

[!NOTE] Number are provided for the Nvidia T4 and 16-mixed precision.

Let's take a look at the traces for Pythia-410m.

Currently zero_pad takes a significant part of the time: Screenshot 2023-10-09 at 7 05 55 PM

[!NOTE] Compare the size of cudaStreamSynchronize from the screenshot above (CUDA 12.1) and the one from the "Performance Study" issue (CUDA 11.8) - it's much smaller thanks to the newest CUDA.

After the code is optimized, from the trace we can see that the zero_pad now takes much less portion of the time: Screenshot 2023-10-09 at 7 08 53 PM

In numbers, it's 830 μs vs 126 μs.

LoRA fine-tuning

If to compare LoRA fine-tuning with Pythia-410m and 1k iterations, we have:

Model	Loss $_{control}$	Loss $_{test}$	Time $_{control}$	Time $_{test}$
Pythia-70m	2.5835	2.5802	30.90	28.51
Pythia-410m	1.7976	1.7976	124.63	114.51

Not a drastic difference, but still a nice optimization.

Nov 23 '23 13:11 Andrei-Aksionov

litgpt litgpt copied to clipboard

LoRA: `zero_pad` speed improvements

Traces

LoRA fine-tuning

litgpt
litgpt copied to clipboard