fail to start awq quantized model with lightllm on qwen2-7b-instruct
awq config
base:
seed: &seed 42
model:
type: Qwen2
path: /models/Qwen2-7B-Instruct
tokenizer_mode: slow
torch_dtype: auto
calib:
name: pileval
download: False
path: /app/src/llmc/tools/data/calib/pileval
n_samples: 128
bs: -1
seq_len: 512
preproc: general
seed: *seed
eval:
eval_pos: [pretrain, transformed, fake_quant]
name: wikitext2
download: False
path: /app/src/llmc/tools/data/eval/wikitext2
bs: 1
inference_per_block: False
# For 70B model eval, bs can be set to 20, and inference_per_block can be set to True.
# For 7B / 13B model eval, bs can be set to 1, and inference_per_block can be set to False.
seq_len: 2048
quant:
method: Awq
weight:
bit: 4
symmetric: False
granularity: per_channel
group_size: -1
calib_algo: learnable
act:
bit: 16
symmetric: False
granularity: per_token
calib_algo: minmax
special:
trans: True
trans_version: v2
weight_clip: True
clip_version: v2
save_scale: True
scale_path: ./save/qwen2-7b-instruct-awq_w4a16-best/scale
save_clip: True
clip_path: ./save/qwen2-7b-instruct-awq_w4a16-best/clip
save:
save_trans: True
save_quant: False
save_lightllm: False
save_path: ./save/qwen2-7b-instruct-awq_w4a16-best/trans
start with lightllm
python -m lightllm.server.api_server --model_dir /app/src/llmc/scripts/save/qwen2-7b-instruct-awq_w4a16-best/trans/transformed_model --host 0.0.0.0 --port 8080 --tp 1 --max_total_token_num 4096
test
curl http://127.0.0.1:8080/generate \
> -X POST \
> -d '{"inputs":"What is AI?","parameters":{"max_new_tokens":17, "frequency_penalty":1}}' \
> -H 'Content-Type: application/json'
get error in lightllm
INFO: Started server process [2195]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO 08-29 02:36:18 [manager.py:105] recieved req X-Request-Id: X-Session-Id: start_time:2024-08-29 02:36:18 lightllm_req_id:0
INFO 08-29 02:36:18 [manager.py:270] lightllm_req_id:0 prompt_cache_len:0 prompt_cache_ratio:0.0
DEBUG 08-29 02:36:18 [manager.py:275] Init Batch: batch_id=e345cbf06da74589808ac2e6a501704d, time:1724898978.7968152s req_ids:[0]
DEBUG 08-29 02:36:18 [manager.py:275]
DEBUG 08-29 02:36:20 [stats.py:37] Avg tokens(prompt+generate) throughput: 0.079 tokens/s
DEBUG 08-29 02:36:20 [stats.py:37] Avg prompt tokens throughput: 0.079 tokens/s
DEBUG 08-29 02:36:20 [stats.py:37] Avg generate tokens throughput: 0.000 tokens/s
Task exception was never retrieved
future: <Task finished name='Task-5' coro=<RouterManager.loop_for_fwd() done, defined at /opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/manager.py:161> exception=RuntimeError('probability tensor contains either `inf`, `nan` or element < 0')>
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/manager.py", line 166, in loop_for_fwd
await self._step()
File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/manager.py", line 240, in _step
await self._decode_batch(self.running_batch)
File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/manager.py", line 305, in _decode_batch
ans = await asyncio.gather(*rets)
File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/model_infer/model_rpc.py", line 162, in decode_batch
ans = self._decode_batch(batch_id)
File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/model_infer/model_rpc.py", line 68, in exposed_decode_batch
return self.backend.decode_batch(batch_id)
File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/utils/infer_utils.py", line 57, in inner_func
result = func(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/model_infer/mode_backend/continues_batch/impl.py", line 22, in decode_batch
return self.forward(batch_id, is_prefill=False)
File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/model_infer/mode_backend/continues_batch/impl.py", line 34, in forward
next_token_ids, next_token_probs = sample(logits, run_reqs, self.eos_id)
File "/opt/conda/lib/python3.9/site-packages/lightllm-2.0.0-py3.9.egg/lightllm/server/router/model_infer/mode_backend/continues_batch/post_process.py", line 44, in sample
sampled_index = torch.multinomial(probs_sort, num_samples=1, replacement=True)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
PS: I get similar error if start lightllm with option "--mode triton_w4a16"
You need to use symmetric mode for awq, if you want to do inference with lightllm. Additionally, remove the act part.
I changed to use following awq config, still get the same error
base:
seed: &seed 42
model:
type: Qwen2
path: /models/Qwen2-7B-Instruct
tokenizer_mode: slow
torch_dtype: auto
calib:
name: pileval
download: False
path: /app/src/llmc/tools/data/calib/pileval
n_samples: 128
bs: -1
seq_len: 512
preproc: general
seed: *seed
eval:
eval_pos: [pretrain, transformed, fake_quant]
name: wikitext2
download: False
path: /app/src/llmc/tools/data/eval/wikitext2
bs: 1
inference_per_block: False
# For 70B model eval, bs can be set to 20, and inference_per_block can be set to True.
# For 7B / 13B model eval, bs can be set to 1, and inference_per_block can be set to False.
seq_len: 2048
quant:
method: Awq
weight:
bit: 4
symmetric: True
granularity: per_channel
group_size: -1
calib_algo: learnable
special:
trans: True
trans_version: v2
weight_clip: True
clip_version: v2
save_scale: True
scale_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/scale
save_clip: True
clip_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/clip
save:
save_trans: True
save_quant: False
save_lightllm: False
save_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/trans
You must still use per-group quantization with 128 group size in llmc to fit the backend kernel.
after changed the config, still get the same error
base:
seed: &seed 42
model:
type: Qwen2
path: /models/Qwen2-7B-Instruct
tokenizer_mode: slow
torch_dtype: auto
calib:
name: pileval
download: False
path: /app/src/llmc/tools/data/calib/pileval
n_samples: 128
bs: -1
seq_len: 512
preproc: general
seed: *seed
eval:
eval_pos: [pretrain, transformed, fake_quant]
name: wikitext2
download: False
path: /app/src/llmc/tools/data/eval/wikitext2
bs: 1
inference_per_block: False
# For 70B model eval, bs can be set to 20, and inference_per_block can be set to True.
# For 7B / 13B model eval, bs can be set to 1, and inference_per_block can be set to False.
seq_len: 2048
quant:
method: Awq
weight:
bit: 4
symmetric: True
granularity: per_channel
group_size: 128
calib_algo: learnable
special:
trans: True
trans_version: v2
weight_clip: True
clip_version: v2
save_scale: True
scale_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/scale
save_clip: True
clip_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/clip
save:
save_trans: True
save_quant: False
save_lightllm: False
save_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/trans
Hi, granularity is not per_channel. You should adjust to per_group.
still get the same error after change to per_group
base:
seed: &seed 42
model:
type: Qwen2
path: /models/Qwen2-7B-Instruct
tokenizer_mode: slow
torch_dtype: auto
calib:
name: pileval
download: False
path: /app/src/llmc/tools/data/calib/pileval
n_samples: 128
bs: -1
seq_len: 512
preproc: general
seed: *seed
eval:
eval_pos: [pretrain, transformed, fake_quant]
name: wikitext2
download: False
path: /app/src/llmc/tools/data/eval/wikitext2
bs: 1
inference_per_block: False
# For 70B model eval, bs can be set to 20, and inference_per_block can be set to True.
# For 7B / 13B model eval, bs can be set to 1, and inference_per_block can be set to False.
seq_len: 2048
quant:
method: Awq
weight:
bit: 4
symmetric: True
granularity: per_group
group_size: 128
calib_algo: learnable
special:
trans: True
trans_version: v2
weight_clip: True
clip_version: v2
save_scale: True
scale_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/scale
save_clip: True
clip_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/clip
save:
save_trans: True
save_quant: False
save_lightllm: False
save_path: ./save/qwen2-7b-instruct-awq_w4a16-lightllm-best/trans
Did you get the error with lightllm quantization mode after fixing the config for llmc? If you do not use quantization kernel, weight clipping in awq makes this reasonable.
I tried with/without "--mode triton_w4a16" to start lightllm, both get the same error
We will try to reproduce the error later. Just wait for some time. You can also try other algorithms.
I get the same error for QuaRot
base:
seed: &seed 42
model:
type: Qwen2
path: /models/Qwen2-7B-Instruct
tokenizer_mode: slow
torch_dtype: auto
eval:
eval_pos: [fake_quant]
name: wikitext2
download: False
path: /app/src/llmc/tools/data/eval/wikitext2
bs: 1
inference_per_block: False
seq_len: 2048
quant:
method: Quarot
weight:
bit: 4
symmetric: False
granularity: per_channel
group_size: -1
qmax_to_tensor: True
calib_algo: minmax
act:
bit: 16
symmetric: False
granularity: per_token
qmax_to_tensor: True
special:
rotate_mode: hadamard
fp32_had: True
online_rotate: False
save:
save_trans: True
save_fake: False
save_path: ./save/qwen2-7b-instruct-quarot_w4a16/trans