aphrodite-engine [Bug]: Segfault with deepseek v2

Your current environment

PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect 
CMake version: version 3.30.2
Libc version: glibc-2.35
Python version: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-1067-aws-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect 
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L4
GPU 1: NVIDIA L4
GPU 2: NVIDIA L4
GPU 3: NVIDIA L4
GPU 4: NVIDIA L4
GPU 5: NVIDIA L4
GPU 6: NVIDIA L4
GPU 7: NVIDIA L4

Nvidia driver version: 535.161.07
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               192
On-line CPU(s) list:                  0-191
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7R13 Processor
CPU family:                           25
Model:                                1
Thread(s) per core:                   2
Core(s) per socket:                   48
Socket(s):                            2
Stepping:                             1
BogoMIPS:                             5299.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            3 MiB (96 instances)
L1i cache:                            3 MiB (96 instances)
L2 cache:                             48 MiB (96 instances)
L3 cache:                             384 MiB (12 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-47,96-143
NUMA node1 CPU(s):                    48-95,144-191
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.5
[pip3] optree==0.12.1
[pip3] torch==2.4.0
[pip3] torcheval==0.0.7
[pip3] torchvision==0.19.0
[pip3] triton==3.0.0
[conda] Could not collect ROCM Version: Could not collect 
Aphrodite Version: 0.6.0
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

🐛 Describe the bug

I am getting a SIGSEGV segfault when trying to run any MoE model, the error is originating from the fused_moe kernel.

Code !aphrodite run /local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/ --host 0.0.0.0 --port 1234 --served-model-name deepseek-v2-lite-chat-16b --trust-remote-code --max-model-len 8192 --enforce-eager --distributed-executor-backend=ray --enable-prefix-caching --tensor-parallel-size 8

error trace

WARNING:  Casting torch.bfloat16 to torch.float16.
2024-09-04 18:16:08,704	INFO worker.py:1740 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
INFO:     
--------------------------------------------------------------------------------
-----
INFO:     Initializing Aphrodite Engine (v0.6.0 commit 75122b2) with the 
following config:
INFO:     Model = '/local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/'
INFO:     DataType = torch.bfloat16
INFO:     Tensor Parallel Size = 8
INFO:     Pipeline Parallel Size = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Context Length = 8192
INFO:     Enforce Eager Mode = True
INFO:     Prefix Caching = True
INFO:     Device = device(type='cuda')
INFO:     Guided Decoding Backend = 
DecodingConfig(guided_decoding_backend='outlines')
INFO:     
--------------------------------------------------------------------------------
-----
INFO:     use_ray_spmd_worker: False
INFO:     driver_ip: 10.168.88.241
WARNING:  Custom allreduce is disabled because it's not supported on more than 
two PCIe-only GPUs. To silence this warning, specify 
disable_custom_all_reduce=True explicitly.
INFO:     Port 2242 is already in use, trying port 2243
INFO:     Loading model /local_disk0/deepseek-ai/DeepSeek-V2-Lite-Chat/...
(RayWorkerWrapper pid=597580) WARNING:  Custom allreduce is disabled because it's not supported on more than 
(RayWorkerWrapper pid=597580) two PCIe-only GPUs. To silence this warning, specify 
(RayWorkerWrapper pid=597580) disable_custom_all_reduce=True explicitly.
Cache shape torch.Size([163840, 64])
(RayWorkerWrapper pid=597580) Cache shape torch.Size([163840, 64])
Loading safetensors checkpoint shards: 100%|██████| 4/4 [00:02<00:00,  1.93it/s]
INFO:     Model loaded in 2.43 seconds.
INFO:     Weights memory usage: 3.74 GiB x 8 = 29.90 GiB
INFO:     Profiling peak memory usage...
INFO:     Model profiling took 2.68 seconds.
INFO:     KV Cache memory usage for 8192 tokens: 1.57 x 8 = 12.57 GB
INFO:     # GPU blocks: 17678, # CPU blocks: 4854
INFO:     Minimum concurrency: 34.53x
INFO:     Maximum sequence length allowed in the cache: 282848
INFO:     Automatic prefix caching is enabled.
WARNING:  embedding_mode is False. Embedding API will not work.
INFO:     Started server process [574684]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:1234/ (Press CTRL+C to quit)
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 
tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 
tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 
tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 
tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.
INFO:     Received request chat-8199dd8619fc4eac871133c01b6cfe4a: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-8199dd8619fc4eac871133c01b6cfe4a.
INFO:     Received request chat-a36271aa269e4a768d24ccac32dbbbf5: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Received request chat-2d93df927630475eb7b9ad161d829897: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Received request chat-cbc29d56184b4c1e98362eb197d1d56a: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-a36271aa269e4a768d24ccac32dbbbf5.
INFO:     Added request chat-2d93df927630475eb7b9ad161d829897.
INFO:     Added request chat-cbc29d56184b4c1e98362eb197d1d56a.
INFO:     Received request chat-82a17f5c614e4c63a3f37fb53b435b39: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-82a17f5c614e4c63a3f37fb53b435b39.
INFO:     Received request chat-a7faa6744e8d4f9892fd706dfe1c6a01: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-a7faa6744e8d4f9892fd706dfe1c6a01.
INFO:     Received request chat-cf25f62dcb5e4f67be34bc73f5dc6541: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-cf25f62dcb5e4f67be34bc73f5dc6541.
INFO:     Received request chat-11ee83b4350e4ce596a764af46cb2ec3: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-11ee83b4350e4ce596a764af46cb2ec3.
INFO:     Received request chat-aa469a8ea0394ee19ad3a0161648448e: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-aa469a8ea0394ee19ad3a0161648448e.
INFO:     Received request chat-c56834e203cc4d4bace31e659a823ed2: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-c56834e203cc4d4bace31e659a823ed2.
INFO:     Received request chat-fde7d8207487449686fefc0ca8121d63: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-fde7d8207487449686fefc0ca8121d63.
INFO:     Received request chat-aacb0eb61c6f44ed9090c8b17c97d71b: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-aacb0eb61c6f44ed9090c8b17c97d71b.
INFO:     Received request chat-57428b1f3be5433785333275be27fcb8: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-57428b1f3be5433785333275be27fcb8.
INFO:     Received request chat-17fd16747b37409a9214ec662d088206: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-17fd16747b37409a9214ec662d088206.
INFO:     Received request chat-435f7988bfb14cc784823322b9c3568a: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-435f7988bfb14cc784823322b9c3568a.
INFO:     Received request chat-f6533432f52c477292ec1870f0f5dc5d: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-f6533432f52c477292ec1870f0f5dc5d.
INFO:     Received request chat-f17a6ee71614421b8871a58d271ab605: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-f17a6ee71614421b8871a58d271ab605.
INFO:     Received request chat-68076ca30522449eb5fbd1024de8d594: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-68076ca30522449eb5fbd1024de8d594.
INFO:     Received request chat-4219b8986d8742b7b6e9299444b69e2f: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-4219b8986d8742b7b6e9299444b69e2f.
INFO:     Received request chat-de8a391b2a844df1b2e8dc4c830875bd: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-de8a391b2a844df1b2e8dc4c830875bd.
INFO:     Received request chat-37ea5f3358fe48b0acf56ea592547155: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-37ea5f3358fe48b0acf56ea592547155.
INFO:     Received request chat-67e047b887624afea9e26752d9cc578e: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-67e047b887624afea9e26752d9cc578e.
INFO:     Received request chat-07df8b7e00de4d6db2f22466a26ea31b: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-07df8b7e00de4d6db2f22466a26ea31b.
INFO:     Received request chat-548313d36f4b487ca84696c0fc609a49: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-548313d36f4b487ca84696c0fc609a49.
INFO:     Received request chat-b0b53bea6aa64ece896d30d49e1d8c08: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-b0b53bea6aa64ece896d30d49e1d8c08.
INFO:     Received request chat-4a022eab0bb74929ab0265a3ab36c8de: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-4a022eab0bb74929ab0265a3ab36c8de.
INFO:     Received request chat-0ecf5e57293a4cb9b51a811e9e961680: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-0ecf5e57293a4cb9b51a811e9e961680.
INFO:     Received request chat-1cd9e6d9d01442fa998d53638c73f080: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-1cd9e6d9d01442fa998d53638c73f080.
INFO:     Received request chat-8013f2bca8104781ac72755a4b0bd1ab: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-8013f2bca8104781ac72755a4b0bd1ab.
INFO:     Received request chat-ce78cd4b38b647c4911489e38c1ae445: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-ce78cd4b38b647c4911489e38c1ae445.
INFO:     Received request chat-d7b73eafbe6a4a70bbbc39b8273b6d5f: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-d7b73eafbe6a4a70bbbc39b8273b6d5f.
INFO:     Received request chat-dce7b6c1b1bd4ed8a49f4c360fed2710: prompt: , 
params: SamplingParams(temperature=0.0, max_tokens=8), prompt_token_ids: [], 
lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-dce7b6c1b1bd4ed8a49f4c360fed2710.
INFO:     Avg prompt throughput: 89.5 tokens/s, Avg generation throughput: 0.1 
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.3%, CPU KV cache usage: 0.0%.
*** SIGSEGV received at time=1725473892 on cpu 22 ***
PC: @           0x5266a0  (unknown)  (unknown)
    @     0x7f74bd980520  (unknown)  (unknown)
    @     0x7f7333647cc0  (unknown)  (unknown)
    @           0x95e040  (unknown)  (unknown)
[2024-09-04 18:18:12,787 E 574765 603291] logging.cc:365: *** SIGSEGV received at time=1725473892 on cpu 22 ***
[2024-09-04 18:18:12,791 E 574765 603291] logging.cc:365: PC: @           0x5266a0  (unknown)  (unknown)
[2024-09-04 18:18:12,791 E 574765 603291] logging.cc:365:     @     0x7f74bd980520  (unknown)  (unknown)
[2024-09-04 18:18:12,795 E 574765 603291] logging.cc:365:     @     0x7f7333647cc0  (unknown)  (unknown)
[2024-09-04 18:18:12,802 E 574765 603291] logging.cc:365:     @           0x95e040  (unknown)  (unknown)
Fatal Python error: Segmentation fault

Stack (most recent call first):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 223 in __init__
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1069 in call_JitFunction
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1109 in visit_Call
  File "/usr/lib/python3.11/ast.py", line 410 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 496 in visit_Assign
  File "/usr/lib/python3.11/ast.py", line 410 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 351 in visit_compound_statement
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 443 in visit_FunctionDef
  File "/usr/lib/python3.11/ast.py", line 410 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
  File "/usr/lib/python3.11/ast.py", line 418 in generic_visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 359 in visit_Module
  File "/usr/lib/python3.11/ast.py", line 410 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1297 in ast_to_ttir
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/compiler/compiler.py", line 113 in make_ir
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/compiler/compiler.py", line 276 in compile
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/runtime/jit.py", line 662 in run
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/triton/runtime/jit.py", line 345 in <lambda>
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/aphrodite/modeling/layers/fused_moe/fused_moe.py", line 246 in invoke_fused_moe_kernel
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/aphrodite/modeling/layers/fused_moe/fused_moe.py", line 511 in fused_experts
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/aphrodite/modeling/layers/fused_moe/fused_moe.py", line 610 in fused_moe
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/aphrodite/modeling/layers/fused_moe/layer.py", line 89 in forward_cuda
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/aphrodite/modeling/_custom_op.py", line 13 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/aphrodite/modeling/layers/fused_moe/layer.py", line 72 in apply
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/aphrodite/modeling/layers/fused_moe/layer.py", line 244 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/aphrodite/modeling/models/deepseek_v2.py", line 139 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/aphrodite/modeling/models/deepseek_v2.py", line 379 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/aphrodite/modeling/models/deepseek_v2.py", line 421 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/aphrodite/modeling/models/deepseek_v2.py", line 454 in forward
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/aphrodite/task_handler/model_runner.py", line 1397 in execute_model
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/aphrodite/task_handler/worker_base.py", line 270 in execute_model
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-171592f6-a687-4ba8-b972-b54f000a4b26/lib/python3.11/site-packages/aphrodite/task_handler/worker_base.py", line 375 in execute_method
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58 in run
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 83 in _worker
  File "/usr/lib/python3.11/threading.py", line 975 in run
  File "/usr/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/usr/lib/python3.11/threading.py", line 995 in _bootstrap

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, zmq.libzmq, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, tornado.speedups, _brotli, simplejson._speedups, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, scipy._lib._ccallback_c, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, pvectorc, PIL._imaging, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, regex._regex, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, snappy._snappy, lz4._version, lz4.frame._frame, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, xxhash._xxhash, pyarrow._json, markupsafe._speedups, ujson, grpc._cython.cygrpc, cuda_utils, __triton_launcher (total: 120)
/usr/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Sep 04 '24 18:09 nivibilla

Thanks for reporting. Seems to be an issue with the FusedMoE triton kernels. I will investigate and see what I can come up with.

Sep 04 '24 19:09 AlpinDale

@AlpinDale I've tried everything I could and I get the same error across multiple llm engines since a lot of them use this Triton kernel. I'm not sure how to do it but is it worth just rewriting the code to not use the kernels? There is a mixtral implementation here that I've tested and it works but none for deepseek and other moe's like dbrx.

https://github.com/PygmalionAI/aphrodite-engine/blob/main/aphrodite/modeling/models/mixtral_quant.py

Sep 09 '24 10:09 nivibilla

This might end up being an issue that I'll need to discuss upstream with the vLLM team. The mixtral_quant modeling code there doesn't use the FusedMoE implementation, and does expert parallelism instead of tensor parallelism between experts, which is quite a bit slower.

Sep 09 '24 10:09 AlpinDale

Yeah makes sense. Also it seems this is an issue specifically for this driver version and I'm unable to update it on databricks. I've spoken to the databricks team but they weren't able to provide a fix yet.

Sep 09 '24 10:09 nivibilla