peft
peft copied to clipboard
Seek help for a more efficient way to use caching to train my model
When I was fine-tuning llama2 model using lora, I came across a problem.
The instruction dataset goes something like this:
"Here's the background to the problem... (1000 identical words)... Now answer the questions in context... (Different question, about 100 words)..." .
Each piece of data has the same very long prefix.
As we know before, during inference, we can pre-computed and cache kv cache, and then pass the valuepast_key_value to speed up the reasoning.
like this
part0 = {}
for k, v in inputs.items():
part0[k] = v[:, :-1]
output_part0 = model(**part0)
outputs = model.generate(
**inputs, past_key_values=output_part0.past_key_values, max_new_tokens=5
)
print(tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[1] :]))
I figure there must be a more efficient way to use caching to train my model . Can anyone give me a suggestion?Thanks a lot.
hi @whyiug thanks for the issue! IIUC caching is effective for inference and not for training, if you pre-compute KV cache offline for training how can you propagate the gradients into them ?
@younesbelkada Thanks for your reply. My training method is lora, where all linear layers in the base model are frozen, and for my input training set they are not trainable, but are double-counted at predict time.
Is it my misunderstanding of lora and backpropagation ? Or maybe people don't have a need for it. @younesbelkada thanks for your advice.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.