[Bug]: Issue when trying to load a AWQ model with --load-in-4bits for mixtral flavors
Your current environment
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.2.0-39-generic-x86_64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 3090
Nvidia driver version: 535.129.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 7900X 12-Core Processor
CPU family: 25
Model: 97
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 1
Stepping: 2
Frequency boost: enabled
CPU max MHz: 5650,0972
CPU min MHz: 3000,0000
BogoMIPS: 9382.48
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization: AMD-V
L1d cache: 384 KiB (12 instances)
L1i cache: 384 KiB (12 instances)
L2 cache: 12 MiB (12 instances)
L3 cache: 64 MiB (2 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-23
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] No relevant packages
[conda] Could not collect ROCM Version: Could not collect
Aphrodite Version: N/A
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled
Thats the output of my host (i 'm running the engine with the official docker image)
🐛 Describe the bug
When I try to load AWQ quant model with --load-in-4bits and the model is a Mixtral kind moe it throw the following stack trace:
(RayWorkerAphrodite pid=1521) INFO: Memory allocated for converted model: 6.04 GiB
(RayWorkerAphrodite pid=1521) INFO: Memory reserved for converted model: 6.08 GiB
(RayWorkerAphrodite pid=1521) INFO: Model weights loaded. Memory usage: 6.04 GiB x 2 = 12.08 GiB
INFO: Model weights loaded. Memory usage: 6.04 GiB x 2 = 12.08 GiB
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 563, in <module>
engine = AsyncAphrodite.from_engine_args(engine_args)
File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
return engine_class(*args, **kwargs)
File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 118, in __init__
self._init_cache()
File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 321, in _init_cache
num_blocks = self._run_workers(
File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 1028, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 136, in profile_num_available_blocks
self.model_runner.profile_run()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/app/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 758, in profile_run
self.execute_model(seqs, kv_caches)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/app/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 692, in execute_model
hidden_states = model_executable(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 413, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 381, in forward
hidden_states, residual = layer(positions, hidden_states,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 344, in forward
hidden_states = self.block_sparse_moe(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 172, in forward
current_hidden_states = expert_layer(hidden_states).mul_(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 105, in forward
w1_out, _ = self.w1(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/app/aphrodite-engine/aphrodite/modeling/layers/linear.py", line 134, in forward
output = self.linear_method.apply_weights(self.linear_weights, x, bias)
File "/app/aphrodite-engine/aphrodite/modeling/layers/quantization/bitsandbytes.py", line 186, in apply_weights
scales_zeros = weights["scales_zeros"].data
KeyError: 'scales_zeros'
entry point command executed inside the docker:
python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 3000 --download-dir /data/hub --model macadeliccc/laser-dolphin-mixtral-4x7b-dpo-AWQ --dtype float16 --kv-cache-dtype fp8_e5m2 --max-model-len 12000 --tensor-parallel-size 2 --gpu-memory-utilization .98 --enforce-eager --block-size 8 --max-paddings 512 --port 3000 --swap-space 10 --chat-template /home/workspace/chat_templates/chat_ml.jinja --served-model-name dolf --max-context-len-to-capture 512 --max-num-batched-tokens 32000 --max-num-seqs 62 --quantization awq --load-in-4bit
Please remove the --quantization awq part and try again.
will try,
@AlpinDale nope, same stack trace, Also checked with an awq non mixtral moe and works like a charm inclusive with the --quantization awq + load in 4 bits, I was able to notice a significant increase in the generation of tokens compared to loading the awq without the load in 4 bits parameter
@AlpinDale nope, same stack trace, Also checked with an awq non mixtral moe and works like a charm inclusive with the --quantization awq + load in 4 bits, I was able to notice a significant increase in the generation of tokens compared to loading the awq without the load in 4 bits parameter
How large is the increase in speed when adding the load-in-4-bit to awq models, and did you notice it on all models? Also, does it affect generation quality at all?
@SalomonKisters in a 4090 I would say that the increase of speed it is noticeable to the naked eye (did not benchmarked yet)
@SalomonKisters in a 4090 I would say that the increase of speed it is noticeable to the naked eye (did not benchmarked yet)
Okay, sounds nice. So you just use AWQ quantized models with "--quantization awq --load-in-4bit"?
yes, but is not working for Moe's, for moes gptq is the best option now I think...
yes, but is not working for Moe's, for moes gptq is the best option now I think...
Ah makes sense, but for GPTQs it doesnt work combined with --load-in-4bit, right?
yes, but is not working for Moe's, for moes gptq is the best option now I think...
Ah makes sense, but for GPTQs it doesnt work combined with --load-in-4bit, right?
right
Any update ? I have the same error Message
As of v0.6.0, --load-in-{4bit,8bit,smooth} args are removed. Please use -q fp8 instead.