ipex-llm
ipex-llm copied to clipboard
[LLM] Add attention_sinks for CodeShell example to optimize multi-turn chat
Description
This PR applies attention_sinks
to optimize multi-turn chat, enabling multi-turn chat without requiring inputs history.
1. Why the change?
The previous multi-turn chat required input history, which could lead to OOM issues after multiple chats.
2. User API changes
Add --attention-sink
to enable this optimization.
python server.py --checkpoint-path --device 'xpu' --cpu-embedding --attention-sink --multi-turn --max-new-tokens 512
3. Summary of the change
- [x] Add codeshell model support in attention_sinks.
- [x] Support attention_sinks with transformers-int4 format.
4. How to test?
- [x] Unit test
Add attention sink as an example, instead of part of bigdl at this moment.