[Bug]:Can't use speculative decoding
Your current environment
Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Gentoo Linux (x86_64) GCC version: (Gentoo 14.2.1_p20241221 p7) 14.2.1 20241221 Clang version: 19.1.4 CMake version: version 3.30.6 Libc version: glibc-2.40
Python version: 3.12.8 (main, Dec 13 2024, 14:15:46) [GCC 14.2.1 20241116] (64-bit runtime) Python platform: Linux-6.6.67-gentoo-dist-x86_64-AMD_Ryzen_9_7950X_16-Core_Processor-with-glibc2.40 Is CUDA available: True CUDA runtime version: 12.6.68 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Nvidia driver version: 565.77 cuDNN version: Probably one of the following: /opt/cuda/targets/x86_64-linux/lib/libcudnn.so.8.8.0 /opt/cuda/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.8.0 /opt/cuda/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.8.0 /opt/cuda/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.8.0 /opt/cuda/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.8.0 /opt/cuda/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.8.0 /opt/cuda/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.8.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 7950X 16-Core Processor CPU family: 25 Model: 97 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 2 CPU(s) scaling MHz: 31% CPU max MHz: 5881.0000 CPU min MHz: 545.0000 BogoMIPS: 9003.01 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze Virtualization: AMD-V L1d cache: 512 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 16 MiB (16 instances) L3 cache: 64 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected
Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A Aphrodite Version: 0.6.5 Aphrodite Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-31 0 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
Model Input Dumps
No response
🐛 Describe the bug
I am attempting to use speculative decoding on a single 3090 and it is failing.
Command:
aphrodite run ~/weights/Qwen2.5-Coder-14B-Instruct -q fp4 --speculative-model ~/weights/Qwen2.5-Coder-0.5B-Instruct --use-v2-block-manager --num_speculative_tokens=5 --uvloop
Output:
(venv) llm@leo ~/aphrodite $ sh ./start.sh
INFO: Multiprocessing frontend to use ipc:///tmp/77175fa8-6bca-4113-b080-b8941f6611b7 for RPC Path.
INFO: Started engine process with PID 482688
WARNING: Casting torch.bfloat16 to torch.float16.
WARNING: To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor
cannot be used
INFO: -------------------------------------------------------------------------------------
INFO: Initializing Aphrodite Engine (v0.6.5 commit cbd51a2) with the following config:
INFO: Model = '/home/llm/weights/Qwen2.5-Coder-14B-Instruct'
INFO: Speculative Config = SpeculativeConfig(draft_model='/home/llm/weights/Qwen2.5-Coder-0.5B-Instruct', num_spec_tokens=5)
INFO: DataType = torch.float16
INFO: Tensor Parallel Size = 1
INFO: Pipeline Parallel Size = 1
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = 'fp4'
INFO: Context Length = 32768
INFO: Enforce Eager Mode = True
INFO: Prefix Caching = False
INFO: Device = device(type='cuda')
INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='lm-format-enforcer')
INFO: Scheduler Steps = 1
INFO: Async Output Processing = False
INFO: -------------------------------------------------------------------------------------
INFO: Configuring SpecDecodeWorker with proposer=\<class 'aphrodite.spec_decode.multi_step_worker.MultiStepWorker'>
INFO: Configuring SpecDecodeWorker with sampler=\<class 'aphrodite.modeling.layers.rejection_sampler.RejectionSampler'>
INFO: Loading model /home/llm/weights/Qwen2.5-Coder-14B-Instruct...
INFO: Loading model in FP4_E2M1 format.
⠙ Loading model weights... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 100% 27.51/27.51 GiB 0:00:16
INFO: Model weights loaded in 16.32 seconds.
INFO: Total model weights memory usage: 9.13 GiB
INFO: Loading model /home/llm/weights/Qwen2.5-Coder-0.5B-Instruct...
⠹ Loading model weights... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 100% 942.29/942.32 GiB 0:00:00
INFO: Model weights loaded in 0.22 seconds.
INFO: Total model weights memory usage: 0.93 GiB
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/llm/aphrodite/venv/lib/python3.12/site-packages/aphrodite/endpoints/openai/rpc/server.py", line 229, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, rpc_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/llm/aphrodite/venv/lib/python3.12/site-packages/aphrodite/endpoints/openai/rpc/server.py", line 39, in __init__
self.engine = AsyncAphrodite.from_engine_args(async_engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/llm/aphrodite/venv/lib/python3.12/site-packages/aphrodite/engine/async_aphrodite.py", line 741, in from_engine_args
engine = cls(
^^^^
File "/home/llm/aphrodite/venv/lib/python3.12/site-packages/aphrodite/engine/async_aphrodite.py", line 630, in __init__
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/llm/aphrodite/venv/lib/python3.12/site-packages/aphrodite/engine/async_aphrodite.py", line 840, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/llm/aphrodite/venv/lib/python3.12/site-packages/aphrodite/engine/async_aphrodite.py", line 263, in __init__
super().__init__(*args, **kwargs)
File "/home/llm/aphrodite/venv/lib/python3.12/site-packages/aphrodite/engine/aphrodite_engine.py", line 294, in __init__
self.model_executor = executor_class(
^^^^^^^^^^^^^^^
File "/home/llm/aphrodite/venv/lib/python3.12/site-packages/aphrodite/executor/executor_base.py", line 46, in __init__
self._init_executor()
File "/home/llm/aphrodite/venv/lib/python3.12/site-packages/aphrodite/executor/gpu_executor.py", line 38, in _init_executor
self.driver_worker.init_device()
File "/home/llm/aphrodite/venv/lib/python3.12/site-packages/aphrodite/spec_decode/spec_decode_worker.py", line 273, in init_device
vocab_size=self._vocab_size)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/functools.py", line 995, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/home/llm/aphrodite/venv/lib/python3.12/site-packages/aphrodite/spec_decode/spec_decode_worker.py", line 926, in _vocab_size
assert all(vocab_sizes[0] == vocab_size for vocab_size in vocab_sizes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
[rank0]:[W122 23:10:47.734096773 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
ERROR: RPCServer process died before responding to readiness probe
You are using models that have different tokenizer dictionary sizes. Find a draft model with the same dictionary size as the target model.