lmdeploy [Bug] Unrecognized configuration class when quantizing llava

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

When running the w4a16 quantization for llava models, the transformers==4.42.0 library will raise the Unrecognized configuration class error, i.e., the llava model class is not registered in transformers and thus cannot be found. I know this is not really a bug with lmdeploy itself. I've seen the exact same issue reported in other repos (e.g. here), where the suggestion was to use transformers==4.31.0; yet it didn't help.

I was also surprised that no one else raised this issue and there seem to be plenty people succeeded in quantizing llava models. Thus by opening this issue I want to see if there's anything wrong on my side.

Note that below I was trying to quantize lmms-lab/llama3-llava-next-8b, but the same error was also there if changing it to liuhaotian/llava-v1.5-7b.

What I've tried

I tried switching transformers version between 4.31.0, the latest 4.42.0, and the one specified by llava authors transformers@ git+https://github.com/huggingface/transformers.git@1c39974a4c4036fd641bc1191cc32799f85715a4; yet none of them worked. This is kind of expected because regardless of the transformers version I'd expect some manual registration performed like here?

Reproduction

export HF_MODEL=lmms-lab/llama3-llava-next-8b
export WORK_DIR=awq/llama3-llava-next-8b-4bit

lmdeploy lite auto_awq \
    $HF_MODEL \
    --calib-dataset 'c4' \
    --calib-samples 512 \
    --calib-seqlen 1024 \
    --w-bits 4 \
    --w-group-size 128 \
    --work-dir $WORK_DIR

Environment

sys.platform: linux
Python: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA L40S
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

LMDeploy: 0.4.1+
transformers: 4.40.2
gradio: 3.50.2
fastapi: 0.111.0
pydantic: 2.7.1
triton: 2.2.0

Error traceback

Traceback (most recent call last):
  File "/home/jz288/anaconda3/envs/lmd/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/home/jz288/anaconda3/envs/lmd/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run
    args.run(args)
  File "/home/jz288/anaconda3/envs/lmd/lib/python3.10/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq
    auto_awq(**kwargs)
  File "/home/jz288/anaconda3/envs/lmd/lib/python3.10/site-packages/lmdeploy/lite/apis/auto_awq.py", line 55, in auto_awq
    model, tokenizer, work_dir = calibrate(model, calib_dataset, calib_samples,
  File "/home/jz288/anaconda3/envs/lmd/lib/python3.10/site-packages/lmdeploy/lite/apis/calibrate.py", line 152, in calibrate
    model = load_hf_from_pretrained(model,
  File "/home/jz288/anaconda3/envs/lmd/lib/python3.10/site-packages/lmdeploy/lite/utils/load.py", line 31, in load_hf_from_pretrained
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/jz288/anaconda3/envs/lmd/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers.models.llava.configuration_llava.LlavaConfig'> for this kind of AutoModel: AutoModelForCausalLM.
Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, LlamaConfig, MambaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

May 16 '24 04:05 zjysteven

The quantization of vl models is not supported until #1553 get merged. May try the PR directly if you are in a hurry.

May 16 '24 06:05 AllentDan

@AllentDan Thanks for your reply. I have two follow-up questions and would appreciate further confirmation.

I'm not sure if I can build from source on my server to include the PR, thus I'm wondering if there is a rough expected time for the new release that will support VLM quantization?
I saw several issues mentioning they have successfully quantized llava models (or other VLMs), see for example #1511, which was about 3 weeks ago. I'm wondering if using an older version of pre-built lmdeploy may possibly work?

May 16 '24 14:05 zjysteven

The next version of lmdeploy will be released in two weeks.
yes, you may try the steps in other issues. Or, you can also try just modifying your locally installed lmdeploy package according to my PR. Since you only want llava, there should be a limited number of files to change. For example, the error log above indicates that you should modify the loading logic in /home/jz288/anaconda3/envs/lmd/lib/python3.10/site-packages/lmdeploy/lite/utils/load.py

May 17 '24 01:05 AllentDan

Thank you very much. Will try. If it's ok I'd like to keep this issue open just for now.

May 17 '24 01:05 zjysteven

Supported in the latest main.

May 24 '24 08:05 AllentDan