ipex-llm
ipex-llm copied to clipboard
memory utilization for 1k input is larger than 3k input for baichuan2-7b with INT4 precision
We cannot reproduce this issue, in our testing, W4A16 Baichuan2 7B's peak memory grows with the input sequence when the max output is 512.
| peak mem (GB) | |
|---|---|
| 1k | 5.341796875 |
| 2k | 5.798828125 |
| 3k | 6.72265625 |
Steps to run:
cd ~/test/ipex-llm/python/llm/dev/benchmark/all-in-oneconda activate bigdl-llm- Set Quantized KV Cache
export BIGDL_QUANTIZE_KV_CACHE=1 - Replace the content of
/home/intel/test/ipex-llm/python/llm/dev/benchmark/all-in-one/config.yamlto this setting:
repo_id:
- 'baichuan-inc/Baichuan2-7B-Chat'
local_model_hub: '/home/intel/LLM'
warm_up: 1
num_trials: 3
num_beams: 1 # default to greedy search
low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
batch_size: 1 # default to 1
in_out_pairs:
- '1024-512'
- '2048-512'
- '3072-512'
test_api:
- "transformer_int4_gpu" # on Intel GPU
cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
streaming: False # whether output in streaming way (only avaiable now for gpu win related test_api)
- Install the correct bigdl-llm version
pip install --pre --upgrade bigdl-llm[xpu]==2.5.0b20240322 -f https://developer.intel.com/ipex-whl-stable-xpu bash run-arc.sh- Check the result csv file in the current folder