sun1092469590
sun1092469590
> Will download the model and try and reproduce this, but I'm noticing that `trust_remote_code=True` is not added in the `AutoModelForCausalLM.from_pretrained`, which means that the model should not be loaded...
thank you very much. my current transformers version is also 4.34.0 and I can run QWen-14B normaly when attention_sink is not added.    
thank you very much for your detailed answer. I will firstly try you method and if does not work I will stop use Flash Attention and test.
1) I stop use Flash Attention by add parameter "use_flash_attn=False" in AutoModelForCausalLM.from_pretrained(), and result is normal as you show me. As is : model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.float16, attention_sink_size=4,...
alse I want to know if Chat model can use attention_sink as Qwen-14B-Chat. and how to use it in chat model
ok,thank you very much , I will try it. I try other method and has output, this is my code, this method is right or wrong? import torch from transformers...
I see. I will try your method, thank you for quick reply.
I use Qwen-14B-Chat and some script in demo /streaming.py to get result , but is very easy to appear OOM,here max_new_tokens=256 and is not very large, my GPU is 4*80G....