Error when trying to load quantized llava-v1.6-34b
Here's what I've done:
- Quantized llava-v1.6-34b with the following code:
from llava.model.builder import load_pretrained_model
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path='/root/ssd/llava-v1.6-34b',
model_base=None,
model_name='llava-v1.6-34b',
load_8bit=False,
load_4bit=True,
device_map='auto',
device='cuda',
use_flash_attn=False
)
tokenizer.save_pretrained('/root/ssd/llava-v1.6-34b-int4')
model.save_pretrained('/root/ssd/llava-v1.6-34b-int4')
- Modified config.json 86L "model_type": "llava_llama" -> "llava".
- Successfully loaded the quantized model using both:
from llava.model import LlavaLlamaForCausalLM
model = LlavaLlamaForCausalLM.from_pretrained('/root/ssd/llava-v1.6-34b-int4')
and
from llava.model.builder import load_pretrained_model
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path='/root/ssd/llava-v1.6-34b-int4',
model_base=None,
model_name='llava-v1.6-34b',
load_8bit=False,
load_4bit=True,
device_map='auto',
device='cuda',
use_flash_attn=False
)
- Successfully loaded the vanilla model with:
from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline(
model_name='liuhaotian/llava-v1.6-34b',
model_path='/root/ssd/llava-v1.6-34b',
backend_config=TurbomindEngineConfig(
tp=4,
model_format='hf',
session_len=8192,
cache_max_entry_count=0.1,
),
log_level='INFO'
)
However, when attempting to load the quantified model as follows, I encounter an error:
pipe = pipeline(
model_name='liuhaotian/llava-v1.6-34b',
model_path='/root/ssd/llava-v1.6-34b-int4', # -int4 here
backend_config=TurbomindEngineConfig(
tp=4,
model_format='hf',
session_len=8192,
cache_max_entry_count=0.1,
),
log_level='INFO'
)
Here's the error message:
2024-04-10 08:09:02,731 - lmdeploy - INFO - Using turbomind engine
2024-04-10 08:09:02,731 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name=None, model_format='hf', tp=4, session_len=8192, max_batch_size=128, cache_max_entry_count=0.1, cache_block_seq_len=64, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192)
2024-04-10 08:09:02,731 - lmdeploy - INFO - input chat_template_config=None
2024-04-10 08:09:02,731 - lmdeploy - WARNING - Could not find liuhaotian/llava-v1.6-34b-int4 in registered models. Register liuhaotian/llava-v1.6-34b-int4 using the BaseChatTemplate.
2024-04-10 08:09:02,731 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='liuhaotian/llava-v1.6-34b-int4', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None)
2024-04-10 08:09:02,781 - lmdeploy - WARNING - model_source: hf_model
2024-04-10 08:09:05,017 - lmdeploy - WARNING - model_config:
[llama]
model_name = base
tensor_para_size = 4
head_num = 56
kv_head_num = 8
vocab_size = 64000
num_layer = 60
inter_size = 73400320
norm_eps = 1e-05
attn_bias = 0
start_id = 64000
end_id = 7
session_len = 8192
weight_type = fp16
rotary_embedding = 128
rope_theta = 5000000.0
size_per_head = 128
group_size = 0
max_batch_size = 128
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.1
cache_block_seq_len = 64
cache_chunk_size = -1
num_tokens_per_iter = 8192
max_prefill_iters = 1
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 4096
rope_scaling_factor = 0.0
use_dynamic_ntk = 0
use_logn_attn = 0
[TM][INFO] Set logger level by INFO
[TM][WARNING] [LlamaTritonModel] `max_context_token_num` = 8192.
Exception in thread Thread-6 (_create_weight_func):
Traceback (most recent call last):
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 196, in _create_weight_func
model_comm.create_shared_weights(device_id, rank)
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32
Exception in thread Thread-7 (_create_weight_func):
Traceback (most recent call last):
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 196, in _create_weight_func
model_comm.create_shared_weights(device_id, rank)
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32
Exception in thread Thread-4 (_create_weight_func):
Traceback (most recent call last):
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 196, in _create_weight_func
model_comm.create_shared_weights(device_id, rank)
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32
Exception in thread Thread-5 (_create_weight_func):
Traceback (most recent call last):
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 196, in _create_weight_func
model_comm.create_shared_weights(device_id, rank)
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32
Exception in thread Thread-8 (_get_params):
Traceback (most recent call last):
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception in thread Thread-9 (_get_params):
Traceback (most recent call last):
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception in thread Thread-10 (_get_params):
Traceback (most recent call last):
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
Exception in thread self._target(*self._args, **self._kwargs)
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 226, in _get_params
self.run()
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
Thread-11 (_get_params):
Traceback (most recent call last):
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run() out = model_comm.get_params(device_id, rank)
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 226, in _get_params
RuntimeError: [TM][ERROR] Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:384
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
self.run()
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 226, in _get_params
out = model_comm.get_params(device_id, rank)self._target(*self._args, **self._kwargs)
File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 226, in _get_params
out = model_comm.get_params(device_id, rank)
RuntimeError
RuntimeError: [TM][ERROR] Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:384
: [TM][ERROR] Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:384
out = model_comm.get_params(device_id, rank)
RuntimeError: [TM][ERROR] Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:384
Despite the error, the GPU memory usage appears to be low (286MiB/22GiB) And this is my pip list:
Package Version
------------------------- -----------
accelerate 0.21.0
addict 2.4.0
aiofiles 23.2.1
altair 5.3.0
annotated-types 0.6.0
anyio 4.3.0
attrs 23.2.0
bitsandbytes 0.43.0
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
contourpy 1.2.1
cycler 0.12.1
einops 0.6.1
einops-exts 0.0.4
exceptiongroup 1.2.0
fastapi 0.110.1
ffmpy 0.3.2
filelock 3.13.3
fire 0.6.0
fonttools 4.50.0
fsspec 2024.3.1
gradio 4.16.0
gradio_client 0.8.1
h11 0.14.0
httpcore 0.17.3
httpx 0.24.0
huggingface-hub 0.22.2
idna 3.6
importlib_metadata 7.1.0
importlib_resources 6.4.0
Jinja2 3.1.3
joblib 1.3.2
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
kiwisolver 1.4.5
llava 1.2.2.post1
lmdeploy 0.3.0
markdown-it-py 3.0.0
markdown2 2.4.13
MarkupSafe 2.1.5
matplotlib 3.8.4
mdurl 0.1.2
mmengine-lite 0.10.3
mpmath 1.3.0
networkx 3.2.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.4.99
nvidia-nvtx-cu12 12.1.105
orjson 3.10.0
packaging 24.0
pandas 2.2.1
peft 0.9.0
pillow 10.3.0
pip 23.3.1
platformdirs 4.2.0
protobuf 5.26.1
psutil 5.9.8
pydantic 2.6.4
pydantic_core 2.16.3
pydub 0.25.1
Pygments 2.17.2
pynvml 11.5.0
pyparsing 3.1.2
python-dateutil 2.9.0.post0
python-multipart 0.0.9
pytz 2024.1
PyYAML 6.0.1
referencing 0.34.0
regex 2023.12.25
requests 2.31.0
rich 13.7.1
rpds-py 0.18.0
ruff 0.3.5
safetensors 0.4.2
scikit-learn 1.2.2
scipy 1.13.0
semantic-version 2.10.0
sentencepiece 0.1.99
setuptools 68.2.2
shellingham 1.5.4
shortuuid 1.0.13
six 1.16.0
sniffio 1.3.1
starlette 0.37.2
svgwrite 1.4.3
sympy 1.12
termcolor 2.4.0
threadpoolctl 3.4.0
tiktoken 0.6.0
timm 0.6.13
tokenizers 0.15.1
tomli 2.0.1
tomlkit 0.12.0
toolz 0.12.1
torch 2.1.2
torchvision 0.16.2
tqdm 4.66.2
transformers 4.37.2
triton 2.1.0
typer 0.12.0
typer-cli 0.12.0
typer-slim 0.12.0
typing_extensions 4.10.0
tzdata 2024.1
urllib3 2.2.1
uvicorn 0.29.0
wavedrom 2.0.3.post3
websockets 11.0.3
wheel 0.41.2
yapf 0.40.2
zipp 3.18.1
Thanks a lot for your help!
Currently, the vl model only support turbomind backend which only accepts awq quantization format. As llava has same format with llama, you can use our quant tools to quant the model.
Here is the guide https://github.com/InternLM/lmdeploy/blob/main/docs/en/quantization/w4a16.md
To quant llava model, you have to modify the code according to this diff https://github.com/InternLM/lmdeploy/commit/0b40aecc5877cd97a0e0622f9cb3fa57298b1d83
By the way, load_4bit use bitsandbytes which uses dynamic quantitative strategy. It is not very efficient and according to my previous test, it is slower compared with fp16/bf16 format.
Currently, the vl model only support turbomind backend which only accepts awq quantization format. As llava has same format with llama, you can use our quant tools to quant the model.
Here is the guide https://github.com/InternLM/lmdeploy/blob/main/docs/en/quantization/w4a16.md
To quant llava model, you have to modify the code according to this diff 0b40aec
By the way,
load_4bituse bitsandbytes which uses dynamic quantitative strategy. It is not very efficient and according to my previous test, it is slower compared with fp16/bf16 format.
非常感谢!量化的程序可以跑起来了,但是会抛出这样一个断言错误:
(lmdeploy) root@ubuntu:~/8h/LLaVA/models# CUDA_VISIBLE_DEVICES=6 lmdeploy lite auto_awq /root/ssd/llava-v1.6-34b --w-group-size 32 --work-dir /root/ssd/llava-v1.6-34b-awq
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|████████████████████████████████████████████████████| 15/15 [00:14<00:00, 1.07it/s]
Move model.embed_tokens to GPU.
Move model.layers.0 to CPU.
Move model.layers.1 to CPU.
Move model.layers.2 to CPU.
Move model.layers.3 to CPU.
Move model.layers.4 to CPU.
Move model.layers.5 to CPU.
Move model.layers.6 to CPU.
Move model.layers.7 to CPU.
Move model.layers.8 to CPU.
Move model.layers.9 to CPU.
Move model.layers.10 to CPU.
Move model.layers.11 to CPU.
Move model.layers.12 to CPU.
Move model.layers.13 to CPU.
Move model.layers.14 to CPU.
Move model.layers.15 to CPU.
Move model.layers.16 to CPU.
Move model.layers.17 to CPU.
Move model.layers.18 to CPU.
Move model.layers.19 to CPU.
Move model.layers.20 to CPU.
Move model.layers.21 to CPU.
Move model.layers.22 to CPU.
Move model.layers.23 to CPU.
Move model.layers.24 to CPU.
Move model.layers.25 to CPU.
Move model.layers.26 to CPU.
Move model.layers.27 to CPU.
Move model.layers.28 to CPU.
Move model.layers.29 to CPU.
Move model.layers.30 to CPU.
Move model.layers.31 to CPU.
Move model.layers.32 to CPU.
Move model.layers.33 to CPU.
Move model.layers.34 to CPU.
Move model.layers.35 to CPU.
Move model.layers.36 to CPU.
Move model.layers.37 to CPU.
Move model.layers.38 to CPU.
Move model.layers.39 to CPU.
Move model.layers.40 to CPU.
Move model.layers.41 to CPU.
Move model.layers.42 to CPU.
Move model.layers.43 to CPU.
Move model.layers.44 to CPU.
Move model.layers.45 to CPU.
Move model.layers.46 to CPU.
Move model.layers.47 to CPU.
Move model.layers.48 to CPU.
Move model.layers.49 to CPU.
Move model.layers.50 to CPU.
Move model.layers.51 to CPU.
Move model.layers.52 to CPU.
Move model.layers.53 to CPU.
Move model.layers.54 to CPU.
Move model.layers.55 to CPU.
Move model.layers.56 to CPU.
Move model.layers.57 to CPU.
Move model.layers.58 to CPU.
Move model.layers.59 to CPU.
Move model.norm to GPU.
Move model.vision_tower to GPU.
Move model.mm_projector to GPU.
Move lm_head to CPU.
Loading calibrate dataset ...
/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/datasets/load.py:1461: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
warnings.warn(
/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/datasets/load.py:1461: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
warnings.warn(
Token indices sequence length is longer than the specified maximum sequence length for this model (1140896 > 4096). Running this sequence through the model will result in indexing errors
model.layers.0, samples: 128, max gpu memory: 13.07 GB
model.layers.1, samples: 128, max gpu memory: 16.57 GB
model.layers.2, samples: 128, max gpu memory: 16.57 GB
model.layers.3, samples: 128, max gpu memory: 16.57 GB
model.layers.4, samples: 128, max gpu memory: 16.57 GB
model.layers.5, samples: 128, max gpu memory: 16.57 GB
model.layers.6, samples: 128, max gpu memory: 16.57 GB
model.layers.7, samples: 128, max gpu memory: 16.57 GB
model.layers.8, samples: 128, max gpu memory: 16.57 GB
model.layers.9, samples: 128, max gpu memory: 16.57 GB
model.layers.10, samples: 128, max gpu memory: 16.57 GB
model.layers.11, samples: 128, max gpu memory: 16.57 GB
model.layers.12, samples: 128, max gpu memory: 16.57 GB
model.layers.13, samples: 128, max gpu memory: 16.57 GB
model.layers.14, samples: 128, max gpu memory: 16.57 GB
model.layers.15, samples: 128, max gpu memory: 16.57 GB
model.layers.16, samples: 128, max gpu memory: 16.57 GB
model.layers.17, samples: 128, max gpu memory: 16.57 GB
model.layers.18, samples: 128, max gpu memory: 16.57 GB
model.layers.19, samples: 128, max gpu memory: 16.57 GB
model.layers.20, samples: 128, max gpu memory: 16.57 GB
model.layers.21, samples: 128, max gpu memory: 16.57 GB
model.layers.22, samples: 128, max gpu memory: 16.57 GB
model.layers.23, samples: 128, max gpu memory: 16.57 GB
model.layers.24, samples: 128, max gpu memory: 16.57 GB
model.layers.25, samples: 128, max gpu memory: 16.57 GB
model.layers.26, samples: 128, max gpu memory: 16.57 GB
model.layers.27, samples: 128, max gpu memory: 16.57 GB
model.layers.28, samples: 128, max gpu memory: 16.57 GB
model.layers.29, samples: 128, max gpu memory: 16.57 GB
model.layers.30, samples: 128, max gpu memory: 16.57 GB
model.layers.31, samples: 128, max gpu memory: 16.57 GB
model.layers.32, samples: 128, max gpu memory: 16.57 GB
model.layers.33, samples: 128, max gpu memory: 16.57 GB
model.layers.34, samples: 128, max gpu memory: 16.57 GB
model.layers.35, samples: 128, max gpu memory: 16.57 GB
model.layers.36, samples: 128, max gpu memory: 16.57 GB
model.layers.37, samples: 128, max gpu memory: 16.57 GB
model.layers.38, samples: 128, max gpu memory: 16.57 GB
model.layers.39, samples: 128, max gpu memory: 16.57 GB
model.layers.40, samples: 128, max gpu memory: 16.57 GB
model.layers.41, samples: 128, max gpu memory: 16.57 GB
model.layers.42, samples: 128, max gpu memory: 16.57 GB
model.layers.43, samples: 128, max gpu memory: 16.57 GB
model.layers.44, samples: 128, max gpu memory: 16.57 GB
model.layers.45, samples: 128, max gpu memory: 16.57 GB
model.layers.46, samples: 128, max gpu memory: 16.57 GB
model.layers.47, samples: 128, max gpu memory: 16.57 GB
model.layers.48, samples: 128, max gpu memory: 16.57 GB
model.layers.49, samples: 128, max gpu memory: 16.57 GB
model.layers.50, samples: 128, max gpu memory: 16.57 GB
model.layers.51, samples: 128, max gpu memory: 16.57 GB
model.layers.52, samples: 128, max gpu memory: 16.57 GB
model.layers.53, samples: 128, max gpu memory: 16.57 GB
model.layers.54, samples: 128, max gpu memory: 16.57 GB
model.layers.55, samples: 128, max gpu memory: 16.57 GB
model.layers.56, samples: 128, max gpu memory: 16.57 GB
model.layers.57, samples: 128, max gpu memory: 16.57 GB
model.layers.58, samples: 128, max gpu memory: 16.57 GB
model.layers.59, samples: 128, max gpu memory: 16.57 GB
model.layers.0 smooth weight done.
model.layers.1 smooth weight done.
model.layers.2 smooth weight done.
model.layers.3 smooth weight done.
model.layers.4 smooth weight done.
model.layers.5 smooth weight done.
model.layers.6 smooth weight done.
model.layers.7 smooth weight done.
model.layers.8 smooth weight done.
model.layers.9 smooth weight done.
model.layers.10 smooth weight done.
model.layers.11 smooth weight done.
model.layers.12 smooth weight done.
model.layers.13 smooth weight done.
model.layers.14 smooth weight done.
model.layers.15 smooth weight done.
model.layers.16 smooth weight done.
model.layers.17 smooth weight done.
model.layers.18 smooth weight done.
model.layers.19 smooth weight done.
model.layers.20 smooth weight done.
model.layers.21 smooth weight done.
model.layers.22 smooth weight done.
Traceback (most recent call last):
File "/root/miniconda3/envs/lmdeploy/bin/lmdeploy", line 8, in <module>
sys.exit(run())
File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 26, in run
args.run(args)
File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq
auto_awq(**kwargs)
File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/apis/auto_awq.py", line 69, in auto_awq
smooth_layers(layers, fc2fcs, norm2fcs, act_scales, w_group_size, device)
File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/quantization/awq.py", line 233, in smooth_layers
smooth_ln_fcs(ln, fcs, a_scales[a_name], group_size)
File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/quantization/awq.py", line 109, in smooth_ln_fcs
assert torch.isnan(p).sum() == 0
AssertionError
看起来似乎和这个issue相关?https://github.com/InternLM/lmdeploy/issues/243 “Token indices sequence length is longer than the specified maximum sequence length for this model (1140896 > 4096). Running this sequence through the model will result in indexing errors”和这一行的信息是否相关?也许是应该换一个calibrate数据集?
--w-group-size 这个参数不要改,turbomind目前只支持128。我昨天用128试过是可以的
--w-group-size 这个参数不要改,turbomind目前只支持128。我昨天用128试过是可以的
我本地尝试了128 64 32,全部在同一个位置抛出了异常 请问您成功量化后的模型,可以分享一下吗?谢谢!
我昨天试的 llava-v1.5-7b 和 llava-v1.6-vicuna-7b。
llava-v1.6-34b 刚试了下也会报这个错,可能和你提到的那个issue的问题。@pppppM 这个目前有什么解决方法么
模型量化崩了,量化校准导致参数出 nan 值了,可能要调整一下校准策略
模型量化崩了,量化校准导致参数出 nan 值了,可能要调整一下校准策略
ref https://github.com/InternLM/lmdeploy/issues/243#issuecomment-1770503299
while , quantize model Using llm deploy , i am also getting issue lmdeploy lite auto_awq ./llama2-chat-7b-w4 --work-dir ./llama2-chat-7b-4bit
Traceback (most recent call last):
File "/home/userdata/.local/bin/lmdeploy", line 8, in
is "./llama2-chat-7b-w4 " already a quantized model?