aphrodite-engine [Bug]: Issue when trying to load a AWQ model with --load-in-4bits for mixtral flavors

Your current environment

PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: Could not collect 
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.2.0-39-generic-x86_64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: Could not collect 
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 535.129.03
cuDNN version: Could not collect 
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A
CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             24
On-line CPU(s) list:                0-23
Vendor ID:                          AuthenticAMD
Model name:                         AMD Ryzen 9 7900X 12-Core Processor
CPU family:                         25
Model:                              97
Thread(s) per core:                 2
Core(s) per socket:                 12
Socket(s):                          1
Stepping:                           2
Frequency boost:                    enabled
CPU max MHz:                        5650,0972
CPU min MHz:                        3000,0000
BogoMIPS:                           9382.48
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          384 KiB (12 instances)
L1i cache:                          384 KiB (12 instances)
L2 cache:                           12 MiB (12 instances)
L3 cache:                           64 MiB (2 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-23
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Versions of relevant libraries:
[pip3] No relevant packages 
[conda] Could not collect ROCM Version: Could not collect 
Aphrodite Version: N/A
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

Thats the output of my host (i 'm running the engine with the official docker image)

🐛 Describe the bug

When I try to load AWQ quant model with --load-in-4bits and the model is a Mixtral kind moe it throw the following stack trace:

(RayWorkerAphrodite pid=1521) INFO:     Memory allocated for converted model: 6.04 GiB
(RayWorkerAphrodite pid=1521) INFO:     Memory reserved for converted model: 6.08 GiB
(RayWorkerAphrodite pid=1521) INFO:     Model weights loaded. Memory usage: 6.04 GiB x 2 = 12.08 GiB
INFO:     Model weights loaded. Memory usage: 6.04 GiB x 2 = 12.08 GiB
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 563, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
    return engine_class(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 118, in __init__
    self._init_cache()
  File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 321, in _init_cache
    num_blocks = self._run_workers(
  File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 1028, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 136, in profile_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 758, in profile_run
    self.execute_model(seqs, kv_caches)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 692, in execute_model
    hidden_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 413, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 381, in forward
    hidden_states, residual = layer(positions, hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 344, in forward
    hidden_states = self.block_sparse_moe(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 172, in forward
    current_hidden_states = expert_layer(hidden_states).mul_(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 105, in forward
    w1_out, _ = self.w1(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/modeling/layers/linear.py", line 134, in forward
    output = self.linear_method.apply_weights(self.linear_weights, x, bias)
  File "/app/aphrodite-engine/aphrodite/modeling/layers/quantization/bitsandbytes.py", line 186, in apply_weights
    scales_zeros = weights["scales_zeros"].data
KeyError: 'scales_zeros'

entry point command executed inside the docker: python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 3000 --download-dir /data/hub --model macadeliccc/laser-dolphin-mixtral-4x7b-dpo-AWQ --dtype float16 --kv-cache-dtype fp8_e5m2 --max-model-len 12000 --tensor-parallel-size 2 --gpu-memory-utilization .98 --enforce-eager --block-size 8 --max-paddings 512 --port 3000 --swap-space 10 --chat-template /home/workspace/chat_templates/chat_ml.jinja --served-model-name dolf --max-context-len-to-capture 512 --max-num-batched-tokens 32000 --max-num-seqs 62 --quantization awq --load-in-4bit

Mar 19 '24 00:03 puppetm4st3r

Please remove the --quantization awq part and try again.

Mar 19 '24 00:03 AlpinDale

will try,

Mar 19 '24 00:03 puppetm4st3r

@AlpinDale nope, same stack trace, Also checked with an awq non mixtral moe and works like a charm inclusive with the --quantization awq + load in 4 bits, I was able to notice a significant increase in the generation of tokens compared to loading the awq without the load in 4 bits parameter

Mar 19 '24 00:03 puppetm4st3r

@AlpinDale nope, same stack trace, Also checked with an awq non mixtral moe and works like a charm inclusive with the --quantization awq + load in 4 bits, I was able to notice a significant increase in the generation of tokens compared to loading the awq without the load in 4 bits parameter

How large is the increase in speed when adding the load-in-4-bit to awq models, and did you notice it on all models? Also, does it affect generation quality at all?

Apr 20 '24 18:04 SalomonKisters

@SalomonKisters in a 4090 I would say that the increase of speed it is noticeable to the naked eye (did not benchmarked yet)

Apr 24 '24 00:04 puppetm4st3r

@SalomonKisters in a 4090 I would say that the increase of speed it is noticeable to the naked eye (did not benchmarked yet)

Okay, sounds nice. So you just use AWQ quantized models with "--quantization awq --load-in-4bit"?

Apr 24 '24 16:04 SalomonKisters

yes, but is not working for Moe's, for moes gptq is the best option now I think...

Apr 24 '24 16:04 puppetm4st3r

yes, but is not working for Moe's, for moes gptq is the best option now I think...

Ah makes sense, but for GPTQs it doesnt work combined with --load-in-4bit, right?

Apr 24 '24 20:04 SalomonKisters

yes, but is not working for Moe's, for moes gptq is the best option now I think...

Ah makes sense, but for GPTQs it doesnt work combined with --load-in-4bit, right?

right

Apr 24 '24 20:04 puppetm4st3r

Any update ? I have the same error Message

May 25 '24 13:05 gningue

As of v0.6.0, --load-in-{4bit,8bit,smooth} args are removed. Please use -q fp8 instead.

Sep 03 '24 13:09 AlpinDale