aphrodite-engine
aphrodite-engine copied to clipboard
[Bug]: Using SillyTavern slow down aphrodite generation speed.
Your current environment
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.5
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.66
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 3090
Nvidia driver version: 531.14
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: GenuineIntel
Model name: 13th Gen Intel(R) Core(TM) i5-13600K
CPU family: 6
Model: 183
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 1
Stepping: 1
BogoMIPS: 6988.79
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 480 KiB (10 instances)
L1i cache: 320 KiB (10 instances)
L2 cache: 20 MiB (10 instances)
L3 cache: 24 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] rotary-embedding-torch==0.5.3
[pip3] torch==2.3.0+cu121
[pip3] torchaudio==2.3.0+cu121
[pip3] torchvision==0.18.0+cu121
[pip3] triton==2.3.0
[conda] Could not collect ROCM Version: Could not collect
Aphrodite Version: 0.5.3
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled
🐛 Describe the bug
I have a 3090 and a 4090, aphrodite launched in a WSL CLI:
aphrodite run Qwen2-72B-Instruct-exl2 -tp 2 --gpu-memory-utilization 1 --kv-cache-dtype fp8 --max-context-len-to-capture 8192 --max-model-len 8192 -q exl2
First I used curl to test the server, the generation speed was 14.0 tok/s at average, then, I use SillyTarven to test, the generation speed decreased to 4.0 tok/s, and stay at 4.0 tok/s even swich back to curl again.
Log:
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.9 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 2.1%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.4 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 2.3%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 14.0 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 2.5%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.9 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 2.6%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.7 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 2.9%, CPU KV cache usage: 0.0%
INFO: Finished request cmpl-ffc35d72f97f4cc188284a294a2df830-0.
INFO: 127.0.0.1:45306 - "POST /v1/completions HTTP/1.1" 200
INFO: 127.0.0.1:59962 - "POST /v1/tokenize HTTP/1.1" 200
INFO: 127.0.0.1:59962 - "POST /v1/tokenize HTTP/1.1" 200
INFO: 127.0.0.1:59962 - "POST /v1/tokenize HTTP/1.1" 200
INFO: Received request cmpl-52ca7796d07a41c29cefb1b975e87005-0: prompt: '', sampling_params:
SamplingParams(repetition_penalty=1.1, temperature=0.5, top_p=0.9, mirostat_mode=2, mirostat_tau=5.0, mirostat_eta=0.1,
stop=['\nUser:'], max_tokens=2048), lora_request: None.
INFO: 127.0.0.1:59970 - "POST /v1/completions HTTP/1.1" 200
INFO: Avg prompt throughput: 1.8 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 0.7%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.1 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%
INFO: Finished request cmpl-52ca7796d07a41c29cefb1b975e87005-0.
INFO: Received request cmpl-df21228374bf4f84b4ede0207c2fda19-0: prompt: '', sampling_params:
SamplingParams(mirostat_mode=2, mirostat_tau=6.5, mirostat_eta=0.2, max_tokens=1024), lora_request: None.
INFO: 127.0.0.1:36398 - "POST /v1/completions HTTP/1.1" 200
INFO: Avg prompt throughput: 0.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%