Fix Bug: Gemma2 the past_key_value.update() function has added a new parameter "sliding_window" to support the _sliding_update function.

Open kkk935208447 opened this issue 1 year ago • 1 comments

What does this PR do?

System Info transformers 4.42.3

Now gemma2 model generates long text that exceeds the window size (>4096), it will report a CUDA error, which seems to be a problem with the failure of the _sliding_update function in HybridCache. The error is as follows:

# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-9b-it",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

input_text = "Write me a poem about Machine Learning." * 800
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids,max_new_tokens = 1150)
print(tokenizer.decode(outputs[0]))

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Before submitting

[x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[] Did you write any new necessary tests?

Who can review?

@ArthurZucker

Jul 04 '24 02:07 kkk935208447

Could you take a look when you have a minute @sanchit-gandhi ?

Jul 04 '24 13:07 LysandreJik