peft icon indicating copy to clipboard operation
peft copied to clipboard

Seek help for a more efficient way to use caching to train my model

Open whyiug opened this issue 1 year ago • 4 comments

When I was fine-tuning llama2 model using lora, I came across a problem. The instruction dataset goes something like this: "Here's the background to the problem... (1000 identical words)... Now answer the questions in context... (Different question, about 100 words)..." . Each piece of data has the same very long prefix.

As we know before, during inference, we can pre-computed and cache kv cache, and then pass the valuepast_key_value to speed up the reasoning. like this


part0 = {}
for k, v in inputs.items():
    part0[k] = v[:, :-1]

output_part0 = model(**part0)

outputs = model.generate(
    **inputs, past_key_values=output_part0.past_key_values, max_new_tokens=5
)
print(tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[1] :]))

I figure there must be a more efficient way to use caching to train my model . Can anyone give me a suggestion?Thanks a lot.

whyiug avatar Apr 07 '24 16:04 whyiug

hi @whyiug thanks for the issue! IIUC caching is effective for inference and not for training, if you pre-compute KV cache offline for training how can you propagate the gradients into them ?

younesbelkada avatar Apr 08 '24 09:04 younesbelkada

@younesbelkada Thanks for your reply. My training method is lora, where all linear layers in the base model are frozen, and for my input training set they are not trainable, but are double-counted at predict time.

whyiug avatar Apr 08 '24 10:04 whyiug

Is it my misunderstanding of lora and backpropagation ? Or maybe people don't have a need for it. @younesbelkada thanks for your advice.

whyiug avatar Apr 10 '24 04:04 whyiug

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions[bot] avatar May 08 '24 15:05 github-actions[bot]