sun1092469590

Results 28 comments of sun1092469590

> Will download the model and try and reproduce this, but I'm noticing that `trust_remote_code=True` is not added in the `AutoModelForCausalLM.from_pretrained`, which means that the model should not be loaded...

thank you very much. my current transformers version is also 4.34.0 and I can run QWen-14B normaly when attention_sink is not added. ![7bb2533b13469abc3e4d8da3e71316a](https://github.com/tomaarsen/attention_sinks/assets/19388387/26391c44-f789-4f2e-be39-fdf27d2da664) ![958d65a94de3a3210ba7bd9d2b9bd43](https://github.com/tomaarsen/attention_sinks/assets/19388387/548c9c8b-1d5e-43c1-8cb3-8beccd7305fb) ![142019338f1b62a5869b09b3de2a6a4](https://github.com/tomaarsen/attention_sinks/assets/19388387/53128241-4115-4679-a0a1-e56d5d35aa31) ![101ec095115660499df401fac840ddc](https://github.com/tomaarsen/attention_sinks/assets/19388387/32cb85cb-e65f-4be8-a338-67c47ffc0d16)

thank you very much for your detailed answer. I will firstly try you method and if does not work I will stop use Flash Attention and test.

1) I stop use Flash Attention by add parameter "use_flash_attn=False" in AutoModelForCausalLM.from_pretrained(), and result is normal as you show me. As is : model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.float16, attention_sink_size=4,...

alse I want to know if Chat model can use attention_sink as Qwen-14B-Chat. and how to use it in chat model

ok,thank you very much , I will try it. I try other method and has output, this is my code, this method is right or wrong? import torch from transformers...

I see. I will try your method, thank you for quick reply.

I use Qwen-14B-Chat and some script in demo /streaming.py to get result , but is very easy to appear OOM,here max_new_tokens=256 and is not very large, my GPU is 4*80G....