ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

[LLM] Add attention_sinks for CodeShell example to optimize multi-turn chat

Open sgwhat opened this issue 1 year ago • 1 comments

Description

This PR applies attention_sinks to optimize multi-turn chat, enabling multi-turn chat without requiring inputs history.

1. Why the change?

The previous multi-turn chat required input history, which could lead to OOM issues after multiple chats.

2. User API changes

Add --attention-sink to enable this optimization.

python server.py --checkpoint-path --device 'xpu' --cpu-embedding --attention-sink --multi-turn --max-new-tokens 512

3. Summary of the change

  • [x] Add codeshell model support in attention_sinks.
  • [x] Support attention_sinks with transformers-int4 format.

4. How to test?

  • [x] Unit test

sgwhat avatar Jan 03 '24 09:01 sgwhat

Add attention sink as an example, instead of part of bigdl at this moment.

jason-dai avatar Jan 03 '24 09:01 jason-dai