ipex-llm [LLM] Add attention_sinks for CodeShell example to optimize multi-turn chat

[LLM] Add attention_sinks for CodeShell example to optimize multi-turn chat

Open sgwhat opened this issue 1 year ago • 1 comments

This PR applies attention_sinks to optimize multi-turn chat, enabling multi-turn chat without requiring inputs history.

The previous multi-turn chat required input history, which could lead to OOM issues after multiple chats.

Add --attention-sink to enable this optimization.

python server.py --checkpoint-path --device 'xpu' --cpu-embedding --attention-sink --multi-turn --max-new-tokens 512

Jan 03 '24 09:01 sgwhat

Add attention sink as an example, instead of part of bigdl at this moment.

Jan 03 '24 09:01 jason-dai