lmdeploy Error when trying to load quantized llava-v1.6-34b

Here's what I've done:

Quantized llava-v1.6-34b with the following code:

from llava.model.builder import load_pretrained_model
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path='/root/ssd/llava-v1.6-34b',
    model_base=None,
    model_name='llava-v1.6-34b',
    load_8bit=False,
    load_4bit=True,
    device_map='auto',
    device='cuda',
    use_flash_attn=False
)
tokenizer.save_pretrained('/root/ssd/llava-v1.6-34b-int4')
model.save_pretrained('/root/ssd/llava-v1.6-34b-int4')

Modified config.json 86L "model_type": "llava_llama" -> "llava".
Successfully loaded the quantized model using both:

from llava.model import LlavaLlamaForCausalLM
model = LlavaLlamaForCausalLM.from_pretrained('/root/ssd/llava-v1.6-34b-int4')

and

from llava.model.builder import load_pretrained_model
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path='/root/ssd/llava-v1.6-34b-int4',
    model_base=None,
    model_name='llava-v1.6-34b',
    load_8bit=False,
    load_4bit=True,
    device_map='auto',
    device='cuda',
    use_flash_attn=False
)

Successfully loaded the vanilla model with:

from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline(
    model_name='liuhaotian/llava-v1.6-34b',
    model_path='/root/ssd/llava-v1.6-34b',
    backend_config=TurbomindEngineConfig(
        tp=4,
        model_format='hf',
        session_len=8192,
        cache_max_entry_count=0.1,
    ),
    log_level='INFO'
)

However, when attempting to load the quantified model as follows, I encounter an error:

pipe = pipeline(
    model_name='liuhaotian/llava-v1.6-34b',
    model_path='/root/ssd/llava-v1.6-34b-int4',  # -int4 here
    backend_config=TurbomindEngineConfig(
        tp=4,
        model_format='hf',
        session_len=8192,
        cache_max_entry_count=0.1,
    ),
    log_level='INFO'
)

Here's the error message:

2024-04-10 08:09:02,731 - lmdeploy - INFO - Using turbomind engine
2024-04-10 08:09:02,731 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name=None, model_format='hf', tp=4, session_len=8192, max_batch_size=128, cache_max_entry_count=0.1, cache_block_seq_len=64, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192)
2024-04-10 08:09:02,731 - lmdeploy - INFO - input chat_template_config=None
2024-04-10 08:09:02,731 - lmdeploy - WARNING - Could not find liuhaotian/llava-v1.6-34b-int4 in registered models. Register liuhaotian/llava-v1.6-34b-int4 using the BaseChatTemplate.
2024-04-10 08:09:02,731 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='liuhaotian/llava-v1.6-34b-int4', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None)
2024-04-10 08:09:02,781 - lmdeploy - WARNING - model_source: hf_model
2024-04-10 08:09:05,017 - lmdeploy - WARNING - model_config:
[llama]
model_name = base
tensor_para_size = 4
head_num = 56
kv_head_num = 8
vocab_size = 64000
num_layer = 60
inter_size = 73400320
norm_eps = 1e-05
attn_bias = 0
start_id = 64000
end_id = 7
session_len = 8192
weight_type = fp16
rotary_embedding = 128
rope_theta = 5000000.0
size_per_head = 128
group_size = 0
max_batch_size = 128
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.1
cache_block_seq_len = 64
cache_chunk_size = -1
num_tokens_per_iter = 8192
max_prefill_iters = 1
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 4096
rope_scaling_factor = 0.0
use_dynamic_ntk = 0
use_logn_attn = 0
[TM][INFO] Set logger level by INFO
[TM][WARNING] [LlamaTritonModel] `max_context_token_num` = 8192.
Exception in thread Thread-6 (_create_weight_func):
Traceback (most recent call last):
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 196, in _create_weight_func
    model_comm.create_shared_weights(device_id, rank)
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32 
Exception in thread Thread-7 (_create_weight_func):
Traceback (most recent call last):
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 196, in _create_weight_func
    model_comm.create_shared_weights(device_id, rank)
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32 
Exception in thread Thread-4 (_create_weight_func):
Traceback (most recent call last):
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 196, in _create_weight_func
    model_comm.create_shared_weights(device_id, rank)
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32 
Exception in thread Thread-5 (_create_weight_func):
Traceback (most recent call last):
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 196, in _create_weight_func
    model_comm.create_shared_weights(device_id, rank)
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32 
Exception in thread Thread-8 (_get_params):
Traceback (most recent call last):
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception in thread Thread-9 (_get_params):
Traceback (most recent call last):
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception in thread Thread-10 (_get_params):
Traceback (most recent call last):
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
Exception in thread     self._target(*self._args, **self._kwargs)
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 226, in _get_params
    self.run()
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
Thread-11 (_get_params):
Traceback (most recent call last):
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()    out = model_comm.get_params(device_id, rank)
    self._target(*self._args, **self._kwargs)
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 226, in _get_params
RuntimeError: [TM][ERROR]  Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:384 
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
    self.run()
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 226, in _get_params
        out = model_comm.get_params(device_id, rank)self._target(*self._args, **self._kwargs)
  File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 226, in _get_params
    out = model_comm.get_params(device_id, rank)
RuntimeError
RuntimeError: [TM][ERROR]  Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:384 
: [TM][ERROR]  Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:384 
    out = model_comm.get_params(device_id, rank)
RuntimeError: [TM][ERROR]  Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:384

Despite the error, the GPU memory usage appears to be low (286MiB/22GiB) And this is my pip list:

Package                   Version
------------------------- -----------
accelerate                0.21.0
addict                    2.4.0
aiofiles                  23.2.1
altair                    5.3.0
annotated-types           0.6.0
anyio                     4.3.0
attrs                     23.2.0
bitsandbytes              0.43.0
certifi                   2024.2.2
charset-normalizer        3.3.2
click                     8.1.7
contourpy                 1.2.1
cycler                    0.12.1
einops                    0.6.1
einops-exts               0.0.4
exceptiongroup            1.2.0
fastapi                   0.110.1
ffmpy                     0.3.2
filelock                  3.13.3
fire                      0.6.0
fonttools                 4.50.0
fsspec                    2024.3.1
gradio                    4.16.0
gradio_client             0.8.1
h11                       0.14.0
httpcore                  0.17.3
httpx                     0.24.0
huggingface-hub           0.22.2
idna                      3.6
importlib_metadata        7.1.0
importlib_resources       6.4.0
Jinja2                    3.1.3
joblib                    1.3.2
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
kiwisolver                1.4.5
llava                     1.2.2.post1
lmdeploy                  0.3.0
markdown-it-py            3.0.0
markdown2                 2.4.13
MarkupSafe                2.1.5
matplotlib                3.8.4
mdurl                     0.1.2
mmengine-lite             0.10.3
mpmath                    1.3.0
networkx                  3.2.1
numpy                     1.26.4
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu12         8.9.2.26
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu12      12.1.0.106
nvidia-nccl-cu12          2.18.1
nvidia-nvjitlink-cu12     12.4.99
nvidia-nvtx-cu12          12.1.105
orjson                    3.10.0
packaging                 24.0
pandas                    2.2.1
peft                      0.9.0
pillow                    10.3.0
pip                       23.3.1
platformdirs              4.2.0
protobuf                  5.26.1
psutil                    5.9.8
pydantic                  2.6.4
pydantic_core             2.16.3
pydub                     0.25.1
Pygments                  2.17.2
pynvml                    11.5.0
pyparsing                 3.1.2
python-dateutil           2.9.0.post0
python-multipart          0.0.9
pytz                      2024.1
PyYAML                    6.0.1
referencing               0.34.0
regex                     2023.12.25
requests                  2.31.0
rich                      13.7.1
rpds-py                   0.18.0
ruff                      0.3.5
safetensors               0.4.2
scikit-learn              1.2.2
scipy                     1.13.0
semantic-version          2.10.0
sentencepiece             0.1.99
setuptools                68.2.2
shellingham               1.5.4
shortuuid                 1.0.13
six                       1.16.0
sniffio                   1.3.1
starlette                 0.37.2
svgwrite                  1.4.3
sympy                     1.12
termcolor                 2.4.0
threadpoolctl             3.4.0
tiktoken                  0.6.0
timm                      0.6.13
tokenizers                0.15.1
tomli                     2.0.1
tomlkit                   0.12.0
toolz                     0.12.1
torch                     2.1.2
torchvision               0.16.2
tqdm                      4.66.2
transformers              4.37.2
triton                    2.1.0
typer                     0.12.0
typer-cli                 0.12.0
typer-slim                0.12.0
typing_extensions         4.10.0
tzdata                    2024.1
urllib3                   2.2.1
uvicorn                   0.29.0
wavedrom                  2.0.3.post3
websockets                11.0.3
wheel                     0.41.2
yapf                      0.40.2
zipp                      3.18.1

Thanks a lot for your help!

Apr 10 '24 08:04 zhaohm14

Currently, the vl model only support turbomind backend which only accepts awq quantization format. As llava has same format with llama, you can use our quant tools to quant the model.

Here is the guide https://github.com/InternLM/lmdeploy/blob/main/docs/en/quantization/w4a16.md

To quant llava model, you have to modify the code according to this diff https://github.com/InternLM/lmdeploy/commit/0b40aecc5877cd97a0e0622f9cb3fa57298b1d83

By the way, load_4bit use bitsandbytes which uses dynamic quantitative strategy. It is not very efficient and according to my previous test, it is slower compared with fp16/bf16 format.

Apr 11 '24 01:04 irexyc

Currently, the vl model only support turbomind backend which only accepts awq quantization format. As llava has same format with llama, you can use our quant tools to quant the model.

Here is the guide https://github.com/InternLM/lmdeploy/blob/main/docs/en/quantization/w4a16.md

To quant llava model, you have to modify the code according to this diff 0b40aec

By the way, load_4bit use bitsandbytes which uses dynamic quantitative strategy. It is not very efficient and according to my previous test, it is slower compared with fp16/bf16 format.

非常感谢！量化的程序可以跑起来了，但是会抛出这样一个断言错误：

(lmdeploy) root@ubuntu:~/8h/LLaVA/models# CUDA_VISIBLE_DEVICES=6 lmdeploy lite auto_awq /root/ssd/llava-v1.6-34b --w-group-size 32 --work-dir /root/ssd/llava-v1.6-34b-awq
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|████████████████████████████████████████████████████| 15/15 [00:14<00:00,  1.07it/s]
Move model.embed_tokens to GPU.
Move model.layers.0 to CPU.
Move model.layers.1 to CPU.
Move model.layers.2 to CPU.
Move model.layers.3 to CPU.
Move model.layers.4 to CPU.
Move model.layers.5 to CPU.
Move model.layers.6 to CPU.
Move model.layers.7 to CPU.
Move model.layers.8 to CPU.
Move model.layers.9 to CPU.
Move model.layers.10 to CPU.
Move model.layers.11 to CPU.
Move model.layers.12 to CPU.
Move model.layers.13 to CPU.
Move model.layers.14 to CPU.
Move model.layers.15 to CPU.
Move model.layers.16 to CPU.
Move model.layers.17 to CPU.
Move model.layers.18 to CPU.
Move model.layers.19 to CPU.
Move model.layers.20 to CPU.
Move model.layers.21 to CPU.
Move model.layers.22 to CPU.
Move model.layers.23 to CPU.
Move model.layers.24 to CPU.
Move model.layers.25 to CPU.
Move model.layers.26 to CPU.
Move model.layers.27 to CPU.
Move model.layers.28 to CPU.
Move model.layers.29 to CPU.
Move model.layers.30 to CPU.
Move model.layers.31 to CPU.
Move model.layers.32 to CPU.
Move model.layers.33 to CPU.
Move model.layers.34 to CPU.
Move model.layers.35 to CPU.
Move model.layers.36 to CPU.
Move model.layers.37 to CPU.
Move model.layers.38 to CPU.
Move model.layers.39 to CPU.
Move model.layers.40 to CPU.
Move model.layers.41 to CPU.
Move model.layers.42 to CPU.
Move model.layers.43 to CPU.
Move model.layers.44 to CPU.
Move model.layers.45 to CPU.
Move model.layers.46 to CPU.
Move model.layers.47 to CPU.
Move model.layers.48 to CPU.
Move model.layers.49 to CPU.
Move model.layers.50 to CPU.
Move model.layers.51 to CPU.
Move model.layers.52 to CPU.
Move model.layers.53 to CPU.
Move model.layers.54 to CPU.
Move model.layers.55 to CPU.
Move model.layers.56 to CPU.
Move model.layers.57 to CPU.
Move model.layers.58 to CPU.
Move model.layers.59 to CPU.
Move model.norm to GPU.
Move model.vision_tower to GPU.
Move model.mm_projector to GPU.
Move lm_head to CPU.
Loading calibrate dataset ...
/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/datasets/load.py:1461: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/datasets/load.py:1461: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Token indices sequence length is longer than the specified maximum sequence length for this model (1140896 > 4096). Running this sequence through the model will result in indexing errors
model.layers.0, samples: 128, max gpu memory: 13.07 GB
model.layers.1, samples: 128, max gpu memory: 16.57 GB
model.layers.2, samples: 128, max gpu memory: 16.57 GB
model.layers.3, samples: 128, max gpu memory: 16.57 GB
model.layers.4, samples: 128, max gpu memory: 16.57 GB
model.layers.5, samples: 128, max gpu memory: 16.57 GB
model.layers.6, samples: 128, max gpu memory: 16.57 GB
model.layers.7, samples: 128, max gpu memory: 16.57 GB
model.layers.8, samples: 128, max gpu memory: 16.57 GB
model.layers.9, samples: 128, max gpu memory: 16.57 GB
model.layers.10, samples: 128, max gpu memory: 16.57 GB
model.layers.11, samples: 128, max gpu memory: 16.57 GB
model.layers.12, samples: 128, max gpu memory: 16.57 GB
model.layers.13, samples: 128, max gpu memory: 16.57 GB
model.layers.14, samples: 128, max gpu memory: 16.57 GB
model.layers.15, samples: 128, max gpu memory: 16.57 GB
model.layers.16, samples: 128, max gpu memory: 16.57 GB
model.layers.17, samples: 128, max gpu memory: 16.57 GB
model.layers.18, samples: 128, max gpu memory: 16.57 GB
model.layers.19, samples: 128, max gpu memory: 16.57 GB
model.layers.20, samples: 128, max gpu memory: 16.57 GB
model.layers.21, samples: 128, max gpu memory: 16.57 GB
model.layers.22, samples: 128, max gpu memory: 16.57 GB
model.layers.23, samples: 128, max gpu memory: 16.57 GB
model.layers.24, samples: 128, max gpu memory: 16.57 GB
model.layers.25, samples: 128, max gpu memory: 16.57 GB
model.layers.26, samples: 128, max gpu memory: 16.57 GB
model.layers.27, samples: 128, max gpu memory: 16.57 GB
model.layers.28, samples: 128, max gpu memory: 16.57 GB
model.layers.29, samples: 128, max gpu memory: 16.57 GB
model.layers.30, samples: 128, max gpu memory: 16.57 GB
model.layers.31, samples: 128, max gpu memory: 16.57 GB
model.layers.32, samples: 128, max gpu memory: 16.57 GB
model.layers.33, samples: 128, max gpu memory: 16.57 GB
model.layers.34, samples: 128, max gpu memory: 16.57 GB
model.layers.35, samples: 128, max gpu memory: 16.57 GB
model.layers.36, samples: 128, max gpu memory: 16.57 GB
model.layers.37, samples: 128, max gpu memory: 16.57 GB
model.layers.38, samples: 128, max gpu memory: 16.57 GB
model.layers.39, samples: 128, max gpu memory: 16.57 GB
model.layers.40, samples: 128, max gpu memory: 16.57 GB
model.layers.41, samples: 128, max gpu memory: 16.57 GB
model.layers.42, samples: 128, max gpu memory: 16.57 GB
model.layers.43, samples: 128, max gpu memory: 16.57 GB
model.layers.44, samples: 128, max gpu memory: 16.57 GB
model.layers.45, samples: 128, max gpu memory: 16.57 GB
model.layers.46, samples: 128, max gpu memory: 16.57 GB
model.layers.47, samples: 128, max gpu memory: 16.57 GB
model.layers.48, samples: 128, max gpu memory: 16.57 GB
model.layers.49, samples: 128, max gpu memory: 16.57 GB
model.layers.50, samples: 128, max gpu memory: 16.57 GB
model.layers.51, samples: 128, max gpu memory: 16.57 GB
model.layers.52, samples: 128, max gpu memory: 16.57 GB
model.layers.53, samples: 128, max gpu memory: 16.57 GB
model.layers.54, samples: 128, max gpu memory: 16.57 GB
model.layers.55, samples: 128, max gpu memory: 16.57 GB
model.layers.56, samples: 128, max gpu memory: 16.57 GB
model.layers.57, samples: 128, max gpu memory: 16.57 GB
model.layers.58, samples: 128, max gpu memory: 16.57 GB
model.layers.59, samples: 128, max gpu memory: 16.57 GB
model.layers.0 smooth weight done.
model.layers.1 smooth weight done.
model.layers.2 smooth weight done.
model.layers.3 smooth weight done.
model.layers.4 smooth weight done.
model.layers.5 smooth weight done.
model.layers.6 smooth weight done.
model.layers.7 smooth weight done.
model.layers.8 smooth weight done.
model.layers.9 smooth weight done.
model.layers.10 smooth weight done.
model.layers.11 smooth weight done.
model.layers.12 smooth weight done.
model.layers.13 smooth weight done.
model.layers.14 smooth weight done.
model.layers.15 smooth weight done.
model.layers.16 smooth weight done.
model.layers.17 smooth weight done.
model.layers.18 smooth weight done.
model.layers.19 smooth weight done.
model.layers.20 smooth weight done.
model.layers.21 smooth weight done.
model.layers.22 smooth weight done.
Traceback (most recent call last):
  File "/root/miniconda3/envs/lmdeploy/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 26, in run
    args.run(args)
  File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq
    auto_awq(**kwargs)
  File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/apis/auto_awq.py", line 69, in auto_awq
    smooth_layers(layers, fc2fcs, norm2fcs, act_scales, w_group_size, device)
  File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/quantization/awq.py", line 233, in smooth_layers
    smooth_ln_fcs(ln, fcs, a_scales[a_name], group_size)
  File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/quantization/awq.py", line 109, in smooth_ln_fcs
    assert torch.isnan(p).sum() == 0
AssertionError

看起来似乎和这个issue相关？https://github.com/InternLM/lmdeploy/issues/243 “Token indices sequence length is longer than the specified maximum sequence length for this model (1140896 > 4096). Running this sequence through the model will result in indexing errors”和这一行的信息是否相关？也许是应该换一个calibrate数据集？

Apr 11 '24 03:04 zhaohm14

--w-group-size 这个参数不要改，turbomind目前只支持128。我昨天用128试过是可以的

Apr 11 '24 03:04 irexyc

--w-group-size 这个参数不要改，turbomind目前只支持128。我昨天用128试过是可以的

我本地尝试了128 64 32，全部在同一个位置抛出了异常请问您成功量化后的模型，可以分享一下吗？谢谢！

Apr 11 '24 03:04 zhaohm14

我昨天试的 llava-v1.5-7b 和 llava-v1.6-vicuna-7b。

llava-v1.6-34b 刚试了下也会报这个错，可能和你提到的那个issue的问题。@pppppM 这个目前有什么解决方法么

Apr 11 '24 06:04 irexyc

模型量化崩了，量化校准导致参数出 nan 值了，可能要调整一下校准策略

Apr 11 '24 08:04 pppppM

模型量化崩了，量化校准导致参数出 nan 值了，可能要调整一下校准策略

ref https://github.com/InternLM/lmdeploy/issues/243#issuecomment-1770503299

Apr 11 '24 09:04 zhyncs

while , quantize model Using llm deploy , i am also getting issue lmdeploy lite auto_awq ./llama2-chat-7b-w4 --work-dir ./llama2-chat-7b-4bit

Traceback (most recent call last): File "/home/userdata/.local/bin/lmdeploy", line 8, in sys.exit(run()) File "/home/userdata/.local/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run args.run(args) File "/home/userdata/.local/lib/python3.10/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq auto_awq(**kwargs) File "/home/userdata/.local/lib/python3.10/site-packages/lmdeploy/lite/apis/auto_awq.py", line 68, in auto_awq smooth_layers(layers, fc2fcs, norm2fcs, act_scales, w_group_size, device) File "/home/userdata/.local/lib/python3.10/site-packages/lmdeploy/lite/quantization/awq.py", line 242, in smooth_layers smooth_ln_fcs(ln, fcs, a_scales[a_name], group_size) File "/home/userdata/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/userdata/.local/lib/python3.10/site-packages/lmdeploy/lite/quantization/awq.py", line 118, in smooth_ln_fcs assert torch.isnan(p).sum() == 0 AssertionError

May 21 '24 10:05 SurenderSardana99

is "./llama2-chat-7b-w4 " already a quantized model?

May 21 '24 14:05 lvhan028