memory utilization for 1k input is larger than 3k input for baichuan2-7b with INT4 precision

Open Fred-cell opened this issue 1 year ago • 1 comments

Mar 25 '24 05:03 Fred-cell

We cannot reproduce this issue, in our testing, W4A16 Baichuan2 7B's peak memory grows with the input sequence when the max output is 512.

	peak mem (GB)
1k	5.341796875
2k	5.798828125
3k	6.72265625

Steps to run:

cd ~/test/ipex-llm/python/llm/dev/benchmark/all-in-one
conda activate bigdl-llm
Set Quantized KV Cache export BIGDL_QUANTIZE_KV_CACHE=1
Replace the content of /home/intel/test/ipex-llm/python/llm/dev/benchmark/all-in-one/config.yaml to this setting:

repo_id:
  - 'baichuan-inc/Baichuan2-7B-Chat'
local_model_hub: '/home/intel/LLM'
warm_up: 1
num_trials: 3
num_beams: 1 # default to greedy search
low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
batch_size: 1 # default to 1
in_out_pairs:
  - '1024-512'
  - '2048-512'
  - '3072-512'
test_api:
  - "transformer_int4_gpu"  # on Intel GPU
cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
streaming: False # whether output in streaming way (only avaiable now for gpu win related test_api)

Install the correct bigdl-llm version pip install --pre --upgrade bigdl-llm[xpu]==2.5.0b20240322 -f https://developer.intel.com/ipex-whl-stable-xpu
bash run-arc.sh
Check the result csv file in the current folder

Mar 28 '24 09:03 NovTi