[Bug]: --enforce-eager reduces performance in -tp >1 significantly compared to vLLM
🐛 Describe the bug
Ran on 1x or 2x 3060 12GB, prompt was single one sentence coding instruction for a sample program
While there is speed reduction with vLLM as well in -tp 2 mode, it is comparable to the -tp 1 reduction, or in single digit %
I would like to use -q FP6 quantization, which enforces eager for now according to https://github.com/aphrodite-engine/aphrodite-engine/issues/1087 and eager slashes performance for unknown reasons.
tp 2 + eager
aphrodite run Qwen/Qwen2.5-Coder-7B-Instruct-AWQ --max-model-len 5000 --swap-space 0 --max-num-seqs 1 --disable-log-requests --enforce-eager --served-model-name model -tp 2
will result in
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
tp 2 + no eager
aphrodite run Qwen/Qwen2.5-Coder-7B-Instruct-AWQ --max-model-len 5000 --swap-space 0 --max-num-seqs 1 --disable-log-requests --served-model-name mo del -tp 2
will result in
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
tp 1 + no eager
aphrodite run Qwen/Qwen2.5-Coder-7B-Instruct-AWQ --max-model-len 5000 --swap-space 0 --max-num-seqs 1 --disable-log-requests --served-model-name model
will result in
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 52.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
tp 1 + eager
aphrodite run Qwen/Qwen2.5-Coder-7B-Instruct-AWQ --max-model-len 5000 --swap-space 0 --max-num-seqs 1 --disable-log-requests --served-model-name model --enforce-eager
will result in
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
Your current environment
The output of `python env.py`
```textPyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 24.04.1 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.39
Python version: 3.12.3 (main, Jan 17 2025, 18:03:48) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 GPU 1: NVIDIA GeForce RTX 3060 GPU 2: NVIDIA GeForce RTX 3060 GPU 3: NVIDIA GeForce RTX 3060 GPU 4: NVIDIA GeForce RTX 3060 GPU 5: NVIDIA GeForce RTX 3060
Nvidia driver version: 565.57.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: AuthenticAMD Model name: AMD Ryzen 5 2600X Six-Core Processor CPU family: 23 Model: 8 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 Stepping: 2 Frequency boost: enabled CPU(s) scaling MHz: 73% CPU max MHz: 3600.0000 CPU min MHz: 2200.0000 BogoMIPS: 7199.59 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es Virtualization: AMD-V L1d cache: 192 KiB (6 instances) L1i cache: 384 KiB (6 instances) L2 cache: 3 MiB (6 instances) L3 cache: 16 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-11 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT vulnerable Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected
Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A Aphrodite Version: 0.6.5 Aphrodite Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX PIX PIX PHB PHB 0-11 0 N/A GPU1 PIX X PIX PIX PHB PHB 0-11 0 N/A GPU2 PIX PIX X PIX PHB PHB 0-11 0 N/A GPU3 PIX PIX PIX X PHB PHB 0-11 0 N/A GPU4 PHB PHB PHB PHB X PHB 0-11 0 N/A GPU5 PHB PHB PHB PHB PHB X 0-11 0 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
</details>