LightCompress fail to start awq quantized model with lightllm on qwen2-7b-instruct

awq config

base:
    seed: &seed 42
model:
    type: Qwen2
    path: /models/Qwen2-7B-Instruct
    tokenizer_mode: slow
    torch_dtype: auto
calib:
    name: pileval
    download: False
    path: /app/src/llmc/tools/data/calib/pileval
    n_samples: 128
    bs: -1
    seq_len: 512
    preproc: general
    seed: *seed
eval:
    eval_pos: [pretrain, transformed, fake_quant]
    name: wikitext2
    download: False
    path: /app/src/llmc/tools/data/eval/wikitext2
    bs: 1
    inference_per_block: False
    # For 70B model eval, bs can be set to 20, and inference_per_block can be set to True.
    # For 7B / 13B model eval, bs can be set to 1, and inference_per_block can be set to False.
    seq_len: 2048
quant:
    method: Awq
    weight:
        bit: 4
        symmetric: False
        granularity: per_channel
        group_size: -1
        calib_algo: learnable
    act:
        bit: 16
        symmetric: False
        granularity: per_token
        calib_algo: minmax
    special:
        trans: True
        trans_version: v2
        weight_clip: True
        clip_version: v2
        save_scale: True
        scale_path: ./save/qwen2-7b-instruct-awq_w4a16-best/scale
        save_clip: True
        clip_path: ./save/qwen2-7b-instruct-awq_w4a16-best/clip
save:
    save_trans: True
    save_quant: False
    save_lightllm: False
    save_path: ./save/qwen2-7b-instruct-awq_w4a16-best/trans

start with lightllm

python -m lightllm.server.api_server --model_dir /app/src/llmc/scripts/save/qwen2-7b-instruct-awq_w4a16-best/trans/transformed_model  --host 0.0.0.0 --port 8080 --tp 1 --max_total_token_num 4096

test

curl http://127.0.0.1:8080/generate     \
>     -X POST                             \
>     -d '{"inputs":"What is AI?","parameters":{"max_new_tokens":17, "frequency_penalty":1}}' \
>     -H 'Content-Type: application/json'

get error in lightllm

INFO:     Started server process [2195]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO 08-29 02:36:18 [manager.py:105] recieved req X-Request-Id: X-Session-Id: start_time:2024-08-29 02:36:18 lightllm_req_id:0 
INFO 08-29 02:36:18 [manager.py:270] lightllm_req_id:0 prompt_cache_len:0 prompt_cache_ratio:0.0 
DEBUG 08-29 02:36:18 [manager.py:275] Init Batch: batch_id=e345cbf06da74589808ac2e6a501704d, time:1724898978.7968152s req_ids:[0] 
DEBUG 08-29 02:36:18 [manager.py:275] 
DEBUG 08-29 02:36:20 [stats.py:37] Avg tokens(prompt+generate) throughput:    0.079 tokens/s
DEBUG 08-29 02:36:20 [stats.py:37] Avg prompt tokens throughput:              0.079 tokens/s
DEBUG 08-29 02:36:20 [stats.py:37] Avg generate tokens throughput:            0.000 tokens/s
Task exception was never retrieved
future: <Task finished name='Task-5' coro=<RouterManager.loop_for_fwd() done, defined at /opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/manager.py:161> exception=RuntimeError('probability tensor contains either `inf`, `nan` or element < 0')>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/manager.py", line 166, in loop_for_fwd
    await self._step()
  File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/manager.py", line 240, in _step
    await self._decode_batch(self.running_batch)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/manager.py", line 305, in _decode_batch
    ans = await asyncio.gather(*rets)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/model_infer/model_rpc.py", line 162, in decode_batch
    ans = self._decode_batch(batch_id)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/model_infer/model_rpc.py", line 68, in exposed_decode_batch
    return self.backend.decode_batch(batch_id)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/utils/infer_utils.py", line 57, in inner_func
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/model_infer/mode_backend/continues_batch/impl.py", line 22, in decode_batch
    return self.forward(batch_id, is_prefill=False)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/model_infer/mode_backend/continues_batch/impl.py", line 34, in forward
    next_token_ids, next_token_probs = sample(logits, run_reqs, self.eos_id)
  File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/model_infer/mode_backend/continues_batch/post_process.py", line 44, in sample
    sampled_index = torch.multinomial(probs_sort, num_samples=1, replacement=True)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

PS: I get similar error if start lightllm with option "--mode triton_w4a16"

Aug 29 '24 02:08 gloritygithub11

You need to use symmetric mode for awq, if you want to do inference with lightllm. Additionally, remove the act part.

Aug 29 '24 04:08 Harahan

I changed to use following awq config, still get the same error

base:
    seed: &seed 42
model:
    type: Qwen2
    path: /models/Qwen2-7B-Instruct
    tokenizer_mode: slow
    torch_dtype: auto
calib:
    name: pileval
    download: False
    path: /app/src/llmc/tools/data/calib/pileval
    n_samples: 128
    bs: -1
    seq_len: 512
    preproc: general
    seed: *seed
eval:
    eval_pos: [pretrain, transformed, fake_quant]
    name: wikitext2
    download: False
    path: /app/src/llmc/tools/data/eval/wikitext2
    bs: 1
    inference_per_block: False
    # For 70B model eval, bs can be set to 20, and inference_per_block can be set to True.
    # For 7B / 13B model eval, bs can be set to 1, and inference_per_block can be set to False.
    seq_len: 2048
quant:
    method: Awq
    weight:
        bit: 4
        symmetric: True
        granularity: per_channel
        group_size: -1
        calib_algo: learnable
    special:
        trans: True
        trans_version: v2
        weight_clip: True
        clip_version: v2
        save_scale: True
        scale_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/scale
        save_clip: True
        clip_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/clip
save:
    save_trans: True
    save_quant: False
    save_lightllm: False
    save_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/trans

Aug 29 '24 05:08 gloritygithub11

You must still use per-group quantization with 128 group size in llmc to fit the backend kernel.

Aug 29 '24 05:08 Harahan

after changed the config, still get the same error

base:
    seed: &seed 42
model:
    type: Qwen2
    path: /models/Qwen2-7B-Instruct
    tokenizer_mode: slow
    torch_dtype: auto
calib:
    name: pileval
    download: False
    path: /app/src/llmc/tools/data/calib/pileval
    n_samples: 128
    bs: -1
    seq_len: 512
    preproc: general
    seed: *seed
eval:
    eval_pos: [pretrain, transformed, fake_quant]
    name: wikitext2
    download: False
    path: /app/src/llmc/tools/data/eval/wikitext2
    bs: 1
    inference_per_block: False
    # For 70B model eval, bs can be set to 20, and inference_per_block can be set to True.
    # For 7B / 13B model eval, bs can be set to 1, and inference_per_block can be set to False.
    seq_len: 2048
quant:
    method: Awq
    weight:
        bit: 4
        symmetric: True
        granularity: per_channel
        group_size: 128
        calib_algo: learnable
    special:
        trans: True
        trans_version: v2
        weight_clip: True
        clip_version: v2
        save_scale: True
        scale_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/scale
        save_clip: True
        clip_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/clip
save:
    save_trans: True
    save_quant: False
    save_lightllm: False
    save_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/trans

Aug 29 '24 06:08 gloritygithub11

Hi, granularity is not per_channel. You should adjust to per_group.

Aug 29 '24 06:08 chengtao-lv

still get the same error after change to per_group

base:
    seed: &seed 42
model:
    type: Qwen2
    path: /models/Qwen2-7B-Instruct
    tokenizer_mode: slow
    torch_dtype: auto
calib:
    name: pileval
    download: False
    path: /app/src/llmc/tools/data/calib/pileval
    n_samples: 128
    bs: -1
    seq_len: 512
    preproc: general
    seed: *seed
eval:
    eval_pos: [pretrain, transformed, fake_quant]
    name: wikitext2
    download: False
    path: /app/src/llmc/tools/data/eval/wikitext2
    bs: 1
    inference_per_block: False
    # For 70B model eval, bs can be set to 20, and inference_per_block can be set to True.
    # For 7B / 13B model eval, bs can be set to 1, and inference_per_block can be set to False.
    seq_len: 2048
quant:
    method: Awq
    weight:
        bit: 4
        symmetric: True
        granularity: per_group
        group_size: 128
        calib_algo: learnable
    special:
        trans: True
        trans_version: v2
        weight_clip: True
        clip_version: v2
        save_scale: True
        scale_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/scale
        save_clip: True
        clip_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/clip
save:
    save_trans: True
    save_quant: False
    save_lightllm: False
    save_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/trans

Aug 29 '24 07:08 gloritygithub11

Did you get the error with lightllm quantization mode after fixing the config for llmc? If you do not use quantization kernel, weight clipping in awq makes this reasonable.

Aug 29 '24 10:08 Harahan

I tried with/without "--mode triton_w4a16" to start lightllm, both get the same error

Aug 29 '24 10:08 gloritygithub11

We will try to reproduce the error later. Just wait for some time. You can also try other algorithms.

Aug 29 '24 10:08 Harahan

I get the same error for QuaRot

base:
    seed: &seed 42
model:
    type: Qwen2
    path: /models/Qwen2-7B-Instruct
    tokenizer_mode: slow
    torch_dtype: auto
eval:
    eval_pos: [fake_quant]
    name: wikitext2
    download: False
    path: /app/src/llmc/tools/data/eval/wikitext2
    bs: 1
    inference_per_block: False
    seq_len: 2048
quant:
    method: Quarot
    weight:
        bit: 4
        symmetric: False
        granularity: per_channel
        group_size: -1
        qmax_to_tensor: True
        calib_algo: minmax
    act:
        bit: 16
        symmetric: False
        granularity: per_token
        qmax_to_tensor: True
    special:
        rotate_mode: hadamard
        fp32_had: True
        online_rotate: False
save:
    save_trans: True
    save_fake: False
    save_path: ./save/qwen2-7b-instruct-quarot_w4a16/trans

Aug 30 '24 07:08 gloritygithub11