vllm
vllm copied to clipboard
[Bug]: Pooling request fails for classification task
Your current environment
Here is the output from python3 collect_env.py
script that I ran inside the docker container.
root@83c981d8de30:/workspace/vllm# python3 collect_env.py
INFO 02-04 22:49:20 __init__.py:186] Automatically detected platform cpu.
Collecting environment information...
PyTorch version: 2.5.1+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Clang version: Could not collect
CMake version: version 3.31.4
Libc version: glibc-2.35
Python version: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 14
On-line CPU(s) list: 0-13
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) Ultra 7 165U
CPU family: 6
Model: 170
Thread(s) per core: 2
Core(s) per socket: 7
Socket(s): 1
Stepping: 4
BogoMIPS: 5375.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 336 KiB (7 instances)
L1i cache: 448 KiB (7 instances)
L2 cache: 14 MiB (7 instances)
L3 cache: 12 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.5.0
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.1
[pip3] torch==2.5.1+cpu
[pip3] torchaudio==2.5.1+cpu
[pip3] torchvision==0.20.1+cpu
[pip3] transformers==4.48.2
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.7.2.dev36+g18016a5e
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
🐛 Describe the bug
I'm using trying to serve a classification model using vLLM on CPU. Here are the steps that I followed:
# Build vLLM docker container for CPU
git clone https://github.com/vllm-project/vllm.git
cd vllm
docker build -f Dockerfile.cpu -t opea/vllm-cpu:latest --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
cd ..
rm -rf vllm
# Lauch vLLM docker container
docker run -d --rm --name="vllm-service" -p 8000:8000 -e VLLM_CPU_KVCACHE_SPACE=400 opea/vllm-cpu:latest --model Intel/polite-guard --task classify --host 0.0.0.0 --port 8000 --uvicorn-log-level debug
# Server logs
docker logs -f vllm-service
Next, after confirming that server is up and running, I used the following python code to make request:
import os
import requests
from transformers import AutoConfig
def get_class_labels(model_name: str):
config = AutoConfig.from_pretrained(model_name)
if hasattr(config, "id2label"):
return list(config.id2label.values())
elif hasattr(config, "label2id"):
return list(config.label2id.keys())
else:
raise ValueError(f"For '{model_name}', can not find `id2label` or `label2id` attribute in config.")
llm_endpoint = "http://localhost:8000"
model_name = "Intel/polite-guard"
class_labels = get_class_labels(model_name)
def predict(input: str):
prompt = {"model": model_name, "input": input}
response = requests.post(llm_endpoint + "/pooling", json=prompt)
print("Status Code:", response.status_code)
return response
if __name__ == "__main__":
input = "He is nice"
response = predict(input)
print(response)
# ----- Output -----
Status Code: 500
<Response [500]>
And the server crashed with the following error message:
INFO 02-06 22:23:31 __init__.py:186] Automatically detected platform cpu.
INFO 02-06 22:23:32 api_server.py:840] vLLM API server version 0.7.2.dev36+g18016a5e
INFO 02-06 22:23:32 api_server.py:841] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='debug', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='Intel/polite-guard', task='classify', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 02-06 22:23:32 api_server.py:206] Started engine process with PID 23
INFO 02-06 22:23:34 config.py:2383] Downcasting torch.float32 to torch.float16.
INFO 02-06 22:23:36 __init__.py:186] Automatically detected platform cpu.
INFO 02-06 22:23:38 config.py:2383] Downcasting torch.float32 to torch.float16.
WARNING 02-06 22:23:39 config.py:678] Async output processing is not supported on the current platform type cpu.
WARNING 02-06 22:23:39 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 02-06 22:23:42 config.py:678] Async output processing is not supported on the current platform type cpu.
WARNING 02-06 22:23:42 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode.
INFO 02-06 22:23:42 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2.dev36+g18016a5e) with config: model='Intel/polite-guard', speculative_config=None, tokenizer='Intel/polite-guard', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Intel/polite-guard, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=PoolerConfig(pooling_type=None, normalize=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 02-06 22:23:43 cpu.py:39] Cannot use None backend on CPU.
INFO 02-06 22:23:43 cpu.py:40] Using Torch SDPA backend.
INFO 02-06 22:23:43 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 02-06 22:23:44 weight_utils.py:252] Using model weights format ['*.safetensors']
INFO 02-06 22:25:12 weight_utils.py:297] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 8.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 8.69it/s]
INFO 02-06 22:25:13 api_server.py:756] Using supplied chat template:
INFO 02-06 22:25:13 api_server.py:756] None
INFO 02-06 22:25:13 launcher.py:21] Available routes are:
INFO 02-06 22:25:13 launcher.py:29] Route: /openapi.json, Methods: HEAD, GET
INFO 02-06 22:25:13 launcher.py:29] Route: /docs, Methods: HEAD, GET
INFO 02-06 22:25:13 launcher.py:29] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 02-06 22:25:13 launcher.py:29] Route: /redoc, Methods: HEAD, GET
INFO 02-06 22:25:13 launcher.py:29] Route: /health, Methods: GET
INFO 02-06 22:25:13 launcher.py:29] Route: /ping, Methods: GET, POST
INFO 02-06 22:25:13 launcher.py:29] Route: /tokenize, Methods: POST
INFO 02-06 22:25:13 launcher.py:29] Route: /detokenize, Methods: POST
INFO 02-06 22:25:13 launcher.py:29] Route: /v1/models, Methods: GET
INFO 02-06 22:25:13 launcher.py:29] Route: /version, Methods: GET
INFO 02-06 22:25:13 launcher.py:29] Route: /v1/chat/completions, Methods: POST
INFO 02-06 22:25:13 launcher.py:29] Route: /v1/completions, Methods: POST
INFO 02-06 22:25:13 launcher.py:29] Route: /v1/embeddings, Methods: POST
INFO 02-06 22:25:13 launcher.py:29] Route: /pooling, Methods: POST
INFO 02-06 22:25:13 launcher.py:29] Route: /score, Methods: POST
INFO 02-06 22:25:13 launcher.py:29] Route: /v1/score, Methods: POST
INFO 02-06 22:25:13 launcher.py:29] Route: /rerank, Methods: POST
INFO 02-06 22:25:13 launcher.py:29] Route: /v1/rerank, Methods: POST
INFO 02-06 22:25:13 launcher.py:29] Route: /v2/rerank, Methods: POST
INFO 02-06 22:25:13 launcher.py:29] Route: /invocations, Methods: POST
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 02-06 22:26:35 logger.py:39] Received request pool-f18cb60a58df4019970e485cfb5f35f2-0: prompt: 'He is nice', params: PoolingParams(additional_metadata=None), prompt_token_ids: [101, 2002, 2003, 3835, 102], lora_request: None, prompt_adapter_request: None.
INFO 02-06 22:26:35 engine.py:275] Added request pool-f18cb60a58df4019970e485cfb5f35f2-0.
ERROR 02-06 22:26:43 client.py:300] RuntimeError('Engine process (pid 23) died.')
ERROR 02-06 22:26:43 client.py:300] NoneType: None
CRITICAL 02-06 22:26:51 launcher.py:101] MQLLMEngine is already dead, terminating server process
INFO: 172.17.0.1:48320 - "POST /pooling HTTP/1.1" 500 Internal Server Error
Unable to get JIT kernel for brgemm. Params: M=5, N=5, K=64, str_a=1, str_b=1, brgemm_type=1, beta=0, a_trans=0, unroll_hint=1, lda=2304, ldb=5, ldc=5, config=0, b_vnni=0INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [1]
PS: I'm unable to run the server on CPU with --task classify
, refer this.
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.