OpenLLM bug: can't load GPTQ quantized model

Describe the bug

I try to run one of TheBloke's quantized models on an A100 40GB. It is not one of the most recent models

To reproduce

openllm start llama --model-id TheBloke/WizardLM-33B-V1-0-Uncensored-SuperHOT-8K-GPTQ --quantize gptq

However I get the following error:

2023-09-28T13:40:58+0000 [ERROR] [runner:llm-llama-runner:1] An exception occurred while instantiating runner 'llm-llama-runner', see details below:
2023-09-28T13:40:58+0000 [ERROR] [runner:llm-llama-runner:1] Traceback (most recent call last):
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 307, in init_local
    self._set_handle(LocalRunnerRef)
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 150, in _set_handle
    runner_handle = handle_class(self, *args, **kwargs)
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner_handle/local.py", line 27, in __init__
    self._runnable = runner.runnable_class(**runner.runnable_init_params)  # type: ignore
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_llm.py", line 1166, in __init__
    if not self.model: raise RuntimeError('Failed to load the model correctly (See traceback above)')
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_llm.py", line 748, in model
    model = self.load_model(*self._model_decls, **self._model_attrs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_assign.py", line 71, in inner
    return fn(self, *decls, **attrs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/serialisation/__init__.py", line 75, in caller
    return getattr(importlib.import_module(f'.{serde}', __name__), fn)(llm, *args, **kwargs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/serialisation/transformers/__init__.py", line 182, in load_model
    model = auto_class.from_pretrained(llm._bentomodel.path, *decls, config=config, trust_remote_code=llm.trust_remote_code, device_map=device_map, **hub_attrs, **attrs).eval()
  File "/home/be/.local/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/home/be/.local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2556, in from_pretrained
    quantization_method_from_config = config.quantization_config.get(
AttributeError: 'GPTQConfig' object has no attribute 'get'

2023-09-28T13:40:58+0000 [ERROR] [runner:llm-llama-runner:1] Traceback (most recent call last):
  File "/home/be/.local/lib/python3.9/site-packages/starlette/routing.py", line 677, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/usr/lib/python3.9/contextlib.py", line 175, in __aenter__
    return await self.gen.__anext__()
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/server/base_app.py", line 75, in lifespan
    on_startup()
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 317, in init_local
    raise e
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 307, in init_local
    self._set_handle(LocalRunnerRef)
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 150, in _set_handle
    runner_handle = handle_class(self, *args, **kwargs)
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner_handle/local.py", line 27, in __init__
    self._runnable = runner.runnable_class(**runner.runnable_init_params)  # type: ignore
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_llm.py", line 1166, in __init__
    if not self.model: raise RuntimeError('Failed to load the model correctly (See traceback above)')
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_llm.py", line 748, in model
    model = self.load_model(*self._model_decls, **self._model_attrs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_assign.py", line 71, in inner
    return fn(self, *decls, **attrs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/serialisation/__init__.py", line 75, in caller
    return getattr(importlib.import_module(f'.{serde}', __name__), fn)(llm, *args, **kwargs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/serialisation/transformers/__init__.py", line 182, in load_model
    model = auto_class.from_pretrained(llm._bentomodel.path, *decls, config=config, trust_remote_code=llm.trust_remote_code, device_map=device_map, **hub_attrs, **attrs).eval()
  File "/home/be/.local/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/home/be/.local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2556, in from_pretrained
    quantization_method_from_config = config.quantization_config.get(
AttributeError: 'GPTQConfig' object has no attribute 'get'

Environment

System information

bentoml: 1.1.6 python: 3.9.2 platform: Linux-5.10.0-23-cloud-amd64-x86_64-with-glibc2.31 uid_gid: 1000:1001 conda: 23.5.0 in_conda_env: True

name: base channels:

defaults dependencies:
_libgcc_mutex=0.1=main
_openmp_mutex=5.1=1_gnu
boltons=23.0.0=py310h06a4308_0
brotlipy=0.7.0=py310h7f8727e_1002
bzip2=1.0.8=h7b6447c_0
ca-certificates=2023.01.10=h06a4308_0
certifi=2023.5.7=py310h06a4308_0
cffi=1.15.1=py310h5eee18b_3
charset-normalizer=2.0.4=pyhd3eb1b0_0
conda=23.5.0=py310h06a4308_0
conda-content-trust=0.1.3=py310h06a4308_0
conda-package-handling=2.1.0=py310h06a4308_0
conda-package-streaming=0.8.0=py310h06a4308_0
cryptography=39.0.1=py310h9ce1e76_0
jsonpatch=1.32=pyhd3eb1b0_0
jsonpointer=2.1=pyhd3eb1b0_0
ld_impl_linux-64=2.38=h1181459_1
libffi=3.4.4=h6a678d5_0
libgcc-ng=11.2.0=h1234567_1
libgomp=11.2.0=h1234567_1
libstdcxx-ng=11.2.0=h1234567_1
libuuid=1.41.5=h5eee18b_0
ncurses=6.4=h6a678d5_0
openssl=1.1.1t=h7f8727e_0
packaging=23.0=py310h06a4308_0
pip=23.0.1=py310h06a4308_0
pluggy=1.0.0=py310h06a4308_1
pycosat=0.6.4=py310h5eee18b_0
pycparser=2.21=pyhd3eb1b0_0
pyopenssl=23.0.0=py310h06a4308_0
pysocks=1.7.1=py310h06a4308_0
python=3.10.10=h7a1cb2a_2
readline=8.2=h5eee18b_0
ruamel.yaml=0.17.21=py310h5eee18b_0
ruamel.yaml.clib=0.2.6=py310h5eee18b_1
setuptools=65.6.3=py310h06a4308_0
six=1.16.0=pyhd3eb1b0_1
sqlite=3.41.2=h5eee18b_0
tk=8.6.12=h1ccaba5_0
toolz=0.12.0=py310h06a4308_0
tqdm=4.65.0=py310h2f386ee_0
urllib3=1.26.15=py310h06a4308_0
wheel=0.38.4=py310h06a4308_0
xz=5.4.2=h5eee18b_0
zlib=1.2.13=h5eee18b_0
zstandard=0.19.0=py310h5eee18b_0
pip:
- absl-py==1.4.0
- accelerate==0.21.0
- addict==2.4.0
- aenum==3.1.12
- aiofiles==23.1.0
- aiohttp==3.8.4
- aiosignal==1.3.1
- altair==5.0.1
- annotated-types==0.5.0
- antlr4-python3-runtime==4.9.3
- anyio==3.7.0
- appdirs==1.4.4
- asgiref==3.7.2
- async-timeout==4.0.2
- attrs==23.1.0
- basicsr==1.4.2
- beautifulsoup4==4.12.2
- bentoml==1.1.0
- blendmodes==2022
- build==0.10.0
- cachetools==5.3.1
- cattrs==23.1.2
- chardet==4.0.0
- circus==0.18.0
- clean-fid==0.1.29
- click==8.1.3
- click-option-group==0.5.6
- clip==1.0
- cloudpickle==2.2.1
- cmake==3.26.4
- compel==2.0.1
- contextlib2==21.6.0
- contourpy==1.0.7
- controlnet-aux==0.0.6
- cssselect2==0.7.0
- cycler==0.11.0
- deepmerge==1.1.0
- deprecated==1.2.14
- deprecation==2.1.0
- diffusers==0.19.3
- einops==0.4.1
- exceptiongroup==1.1.1
- facexlib==0.3.0
- fastapi==0.100.1
- ffmpy==0.3.0
- filelock==3.12.2
- filterpy==1.4.5
- flatbuffers==23.5.26
- font-roboto==0.0.1
- fonts==0.0.3
- fonttools==4.40.0
- frozenlist==1.3.3
- fs==2.4.16
- fsspec==2023.6.0
- ftfy==6.1.1
- future==0.18.3
- fvcore==0.1.5.post20221221
- gdown==4.7.1
- gfpgan==1.3.8
- gitdb==4.0.10
- gitpython==3.1.30
- google-auth==2.19.1
- google-auth-oauthlib==1.0.0
- gradio==3.28.1
- gradio-client==0.2.6
- grpcio==1.54.2
- h11==0.12.0
- httpcore==0.15.0
- httpx==0.24.1
- huggingface-hub==0.15.1
- idna==2.10
- imageio==2.31.1
- importlib-metadata==6.0.1
- inflection==0.5.1
- invisible-watermark==0.2.0
- iopath==0.1.9
- jinja2==3.1.2
- jsonmerge==1.8.0
- jsonschema==4.17.3
- kiwisolver==1.4.4
- kornia==0.6.7
- lark==1.1.2
- lazy-loader==0.2
- lightning-utilities==0.8.0
- linkify-it-py==2.0.2
- lit==16.0.5.post0
- llvmlite==0.40.1rc1
- lmdb==1.4.1
- lpips==0.1.4
- lxml==4.9.2
- markdown==3.4.3
- markdown-it-py==2.2.0
- markupsafe==2.1.3
- matplotlib==3.7.1
- mdit-py-plugins==0.3.3
- mdurl==0.1.2
- mediapipe==0.10.1
- mpmath==1.3.0
- multidict==6.0.4
- mypy-extensions==1.0.0
- networkx==3.1
- numba==0.57.1
- numpy==1.24.4
- nvidia-cublas-cu11==11.10.3.66
- nvidia-cuda-cupti-cu11==11.7.101
- nvidia-cuda-nvrtc-cu11==11.7.99
- nvidia-cuda-runtime-cu11==11.7.99
- nvidia-cudnn-cu11==8.5.0.96
- nvidia-cufft-cu11==10.9.0.58
- nvidia-curand-cu11==10.2.10.91
- nvidia-cusolver-cu11==11.4.0.1
- nvidia-cusparse-cu11==11.7.4.91
- nvidia-nccl-cu11==2.14.3
- nvidia-nvtx-cu11==11.7.91
- oauthlib==3.2.2
- omegaconf==2.2.3
- open-clip-torch==2.7.0
- opencv-contrib-python==4.7.0.72
- opencv-python==4.7.0.72
- opentelemetry-api==1.18.0
- opentelemetry-instrumentation==0.39b0
- opentelemetry-instrumentation-aiohttp-client==0.39b0
- opentelemetry-instrumentation-asgi==0.39b0
- opentelemetry-sdk==1.18.0
- opentelemetry-semantic-conventions==0.39b0
- opentelemetry-util-http==0.39b0
- orjson==3.9.1
- pandas==2.0.2
- pathspec==0.11.1
- piexif==1.1.3
- pillow==9.4.0
- pip-requirements-parser==32.0.1
- pip-tools==6.13.0
- platformdirs==3.5.3
- portalocker==2.7.0
- prometheus-client==0.17.0
- protobuf==3.20.0
- psutil==5.9.5
- pyasn1==0.5.0
- pyasn1-modules==0.3.0
- pydantic==2.1.1
- pydantic-core==2.4.0
- pydub==0.25.1
- pygments==2.15.1
- pynvml==11.5.0
- pyparsing==3.0.9
- pyproject-hooks==1.0.0
- pyre-extensions==0.0.29
- pyrsistent==0.19.3
- python-dateutil==2.8.2
- python-json-logger==2.0.7
- python-multipart==0.0.6
- pytorch-lightning==1.9.4
- pytz==2023.3
- pywavelets==1.4.1
- pyyaml==6.0
- pyzmq==25.1.0
- realesrgan==0.3.0
- regex==2023.6.3
- reportlab==4.0.4
- requests==2.25.1
- requests-oauthlib==1.3.1
- resize-right==0.0.2
- rich==13.4.2
- rsa==4.9
- safetensors==0.3.1
- schema==0.7.5
- scikit-image==0.19.2
- scipy==1.10.1
- semantic-version==2.10.0
- sentencepiece==0.1.99
- simple-di==0.1.5
- smmap==5.0.0
- sniffio==1.3.0
- sounddevice==0.4.6
- soupsieve==2.4.1
- starlette==0.27.0
- svglib==1.5.1
- sympy==1.12
- tabulate==0.9.0
- tb-nightly==2.14.0a20230613
- tensorboard-data-server==0.7.0
- termcolor==2.3.0
- tifffile==2023.4.12
- timm==0.6.7
- tinycss2==1.2.1
- tokenizers==0.13.3
- tomesd==0.1.3
- tomli==2.0.1
- torch==2.0.1
- torchdiffeq==0.2.3
- torchmetrics==0.11.4
- torchsde==0.2.5
- torchvision==0.15.2
- tornado==6.3.2
- trampoline==0.1.2
- transformers==4.31.0
- triton==2.0.0
- typing-extensions==4.6.3
- typing-inspect==0.9.0
- tzdata==2023.3
- uc-micro-py==1.0.2
- uvicorn==0.22.0
- watchfiles==0.19.0
- wcwidth==0.2.6
- webencodings==0.5.1
- websockets==11.0.3
- werkzeug==2.3.6
- wrapt==1.15.0
- xformers==0.0.20
- yacs==0.1.8
- yapf==0.40.0
- yarl==1.9.2
- zipp==3.15.0

Sep 28 '23 13:09 BEpresent

I'm trying to run "TheBloke/Llama-2-13B-chat-GPTQ" using version 0.3.6 and I get the same error:

2023-10-13T09:36:44+0300 [ERROR] [runner:llm-llama-runner:1] Traceback (most recent call last): File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/starlette/routing.py", line 705, in lifespan async with self.lifespan_context(app) as maybe_state: File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/contextlib.py", line 181, in aenter return await self.gen.anext() File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/bentoml/_internal/server/base_app.py", line 75, in lifespan on_startup() File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 317, in init_local raise e File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 307, in init_local self._set_handle(LocalRunnerRef) File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 150, in _set_handle runner_handle = handle_class(self, *args, **kwargs) File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/bentoml/_internal/runner/runner_handle/local.py", line 27, in init self._runnable = runner.runnable_class(**runner.runnable_init_params) # type: ignore File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/openllm/_llm.py", line 1166, in init if not self.model: raise RuntimeError('Failed to load the model correctly (See traceback above)') File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/openllm/_llm.py", line 748, in model model = self.load_model(*self._model_decls, **self._model_attrs) File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/openllm/_assign.py", line 71, in inner return fn(self, *decls, **attrs) File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/openllm/serialisation/init.py", line 75, in caller return getattr(importlib.import_module(f'.{serde}', name), fn)(llm, *args, **kwargs) File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/openllm/serialisation/transformers/init.py", line 182, in load_model model = auto_class.from_pretrained(llm._bentomodel.path, *decls, config=config, trust_remote_code=llm.trust_remote_code, device_map=device_map, **hub_attrs, **attrs).eval() File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained return model_class.from_pretrained( File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2683, in from_pretrained quantization_method_from_config = config.quantization_config.get( AttributeError: 'GPTQConfig' object has no attribute 'get'

I wonder whether this is related to the models being not quite recent ones? (In light of the previous comment)

Oct 13 '23 06:10 soydan

I wonder whether this is related to the models being not quite recent ones? (In light of the previous comment)

This could be - on the TGI repo they mention it could have to do with some old quantization script from TheBloke (different error in TGI, but my guess is it might be similar).

Oct 13 '23 08:10 BEpresent

OpenLLM OpenLLM copied to clipboard

bug: can't load GPTQ quantized model

Describe the bug

To reproduce

Environment

System information

OpenLLM
OpenLLM copied to clipboard