MiniCPM-V [vllm] - <title>How to use fast thinking in vllm?

起始日期 | Start Date

No response

实现PR | Implementation PR

No response

摘要 | Summary

As per title, wondering how to enable fast thinking mode when using this model

基本示例 | Basic Example

fast thinking

缺陷 | Drawbacks

no drawbacks

未解决问题 | Unresolved questions

No response

Aug 30 '25 18:08 Magmanat

@Magmanat Received, we will add instructions in the readme to make it easier for you to see how to use it.

Sep 01 '25 03:09 tc-mb

@tc-mb Any update on this?

Reason being currently I have a task using vllm, I have enabled thinking via the extra_body parameter

const response = await this.client.chat.completions.create({
      // eslint-disable-next-line @typescript-eslint/no-explicit-any
      messages: messages as any,
      model: this.model,
      max_completion_tokens: 1500,
      top_p: 0.95,
      temperature: 0.6,
      // @ts-expect-error chat_template_kwargs is not yet public
      chat_template_kwargs: {"enable_thinking": true}
    });

But several responses either have fast thinking like:

Here is the analysis of the provided video:

{
  "event": true,
  "importance": "medium",
  "event-type": "Vandalism",
  "description": "A person is seen approaching a car and opening its door, which could potentially lead to damaging the vehicle."
}
</think>

{
  "event": true,
  "importance": "medium",
  "event-type": "Vandalism",
  "description": "A person is seen approaching a car and opening its door, which could potentially lead to damaging the vehicle."
}

or even no thinking

Here is the analysis of the video:

{
  "event": true,
  "importance": "high",
  "event-type": "Medical emergencies",
  "description": "A person is seen walking towards a car, then suddenly bends over, appearing to be in distress. They then open the car door and remain inside, suggesting a possible medical emergency."
}

When what I want consistently is

I need to analyze the video and identify the most important predefined event. First, I need to look at each frame carefully.

The video shows a parking area with a grey car and a white car. At the beginning, there's no movement. Then, a woman wearing a striped shirt and grey pants enters the scene from the bottom right. She walks towards the grey car, opens the driver's side door, and gets inside. The car's lights are on throughout.

Now, let's check the predefined event types. None of these are visible. The woman is simply entering the car, no damage, no medical emergencies.

Since no high or medium importance events are detected, the result should be event: false.
</think>

{
  "event": false,
  "importance": null,
  "event-type": null,
  "description": "A woman walks into the frame, approaches a parked grey car, opens the driver's side door, and enters the vehicle."
}

Sep 02 '25 11:09 Magmanat

Also for additional context, I am using bitsandbytes to quantize the model. Im not too sure if this could be affecting the hybrid thinking process as well.

Sep 02 '25 11:09 Magmanat

https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-v4_5_vllm.md We have added how to enable think mode in the documentation, you can check it out.

Sep 02 '25 11:09 tc-mb

Hi, thanks for the update. Actually I am already using this parameter, but I am running into the problem of not being able to ensure it does deep thinking. So for the difficult classification problems I am giving it sometimes it doesnt think, and sometimes its thinking is very short as shown above. Is there a way to force it to do a deep thinking?

Otherwise if this does not work I will try manipulating by using a chat_template that completes some of the assistant answer first.

I believed there was a way to ensure the deep thinking because the README said there was controllable hybrid fast/deep thinking

MiniCPM-V 4.5: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, this model outperforms GPT-4o-latest, Gemini-2.0 Pro, and Qwen2.5-VL 72B in vision-language capabilities, making it the most performant on-device multimodal model in the open-source community. This version brings new features including efficient high-FPS and long video understanding (up to 96x compression rate for video tokens),

controllable hybrid fast/deep thinking

, strong handwritten OCR and complex table/document parsing. It also advances MiniCPM-V's popular features such as trustworthy behavior, multilingual support and end-side deployability.

unless the control meant only turning on/off thinking capability

Sep 02 '25 12:09 Magmanat

@Magmanat This switch should enforce the thinking mode. While short thinking may occur, skipping should not occur. If so, please send us the bad case and we will troubleshoot the issue as soon as possible.

Sep 02 '25 12:09 tc-mb

For example:

VLLM serving:

vllm serve models/MiniCPM-V-4_5 \
      --dtype bfloat16 \
      --gpu-memory-utilization 0.92 \
      --quantization bitsandbytes \
      --load-format bitsandbytes \
      --max-model-len 6000 \
      --max-num-seqs 10 \
      --max-num-batched-tokens 8000 \
      --trust-remote-code \
      --limit-mm-per-prompt '{"image": 30, "video": 1}' \
      --allowed-local-media-path "/home/prince5090/Desktop/arvas-upgrade-web" \
      --tensor-parallel-size 1 \
      --pipeline-parallel-size 1 \
      --enable-prefix-caching \
      --enable-chunked-prefill \
      --max-logprobs 0 \
      --swap-space 4 \
      --cpu-offload-gb 0 \
      --port 8000 \
      --host 0.0.0.0 \
      --scheduling-policy priority \
      --media-io-kwargs '{"video":{"num_frames":64}}'

I get this output

=== Running 30 video inferences ===
Serving files at http://localhost:8080
Video will be served at: http://localhost:8080/external_vids/car_video.mp4
Running inference 1... 127.0.0.1 - - [03/Sep/2025 03:14:10] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (0.90s)
  -> No </think> found in response 1
Running inference 2... 127.0.0.1 - - [03/Sep/2025 03:14:11] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (3.72s)
  -> </think> found in response 2
Running inference 3... 127.0.0.1 - - [03/Sep/2025 03:14:15] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.26s)
  -> </think> found in response 3
Running inference 4... 127.0.0.1 - - [03/Sep/2025 03:14:17] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (1.42s)
  -> </think> found in response 4
Running inference 5... 127.0.0.1 - - [03/Sep/2025 03:14:18] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (4.65s)
  -> </think> found in response 5
Running inference 6... 127.0.0.1 - - [03/Sep/2025 03:14:23] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.52s)
  -> </think> found in response 6
Running inference 7... 127.0.0.1 - - [03/Sep/2025 03:14:26] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (3.65s)
  -> </think> found in response 7
Running inference 8... 127.0.0.1 - - [03/Sep/2025 03:14:29] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.81s)
  -> </think> found in response 8
Running inference 9... 127.0.0.1 - - [03/Sep/2025 03:14:32] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (1.19s)
  -> </think> found in response 9
Running inference 10... 127.0.0.1 - - [03/Sep/2025 03:14:33] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (0.95s)
  -> No </think> found in response 10
Running inference 11... 127.0.0.1 - - [03/Sep/2025 03:14:34] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (4.02s)
  -> </think> found in response 11
Running inference 12... 127.0.0.1 - - [03/Sep/2025 03:14:38] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (3.18s)
  -> </think> found in response 12
Running inference 13... 127.0.0.1 - - [03/Sep/2025 03:14:41] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (1.29s)
  -> </think> found in response 13
Running inference 14... 127.0.0.1 - - [03/Sep/2025 03:14:43] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (1.72s)
  -> </think> found in response 14
Running inference 15... 127.0.0.1 - - [03/Sep/2025 03:14:44] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (4.33s)
  -> </think> found in response 15
Running inference 16... 127.0.0.1 - - [03/Sep/2025 03:14:49] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.89s)
  -> </think> found in response 16
Running inference 17... 127.0.0.1 - - [03/Sep/2025 03:14:52] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.36s)
  -> </think> found in response 17
Running inference 18... 127.0.0.1 - - [03/Sep/2025 03:14:54] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (0.79s)
  -> No </think> found in response 18
Running inference 19... 127.0.0.1 - - [03/Sep/2025 03:14:55] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.94s)
  -> </think> found in response 19
Running inference 20... 127.0.0.1 - - [03/Sep/2025 03:14:58] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.65s)
  -> </think> found in response 20
Running inference 21... 127.0.0.1 - - [03/Sep/2025 03:15:00] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.78s)
  -> </think> found in response 21
Running inference 22... 127.0.0.1 - - [03/Sep/2025 03:15:03] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.91s)
  -> </think> found in response 22
Running inference 23... 127.0.0.1 - - [03/Sep/2025 03:15:06] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.98s)
  -> </think> found in response 23
Running inference 24... 127.0.0.1 - - [03/Sep/2025 03:15:09] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (1.68s)
  -> </think> found in response 24
Running inference 25... 127.0.0.1 - - [03/Sep/2025 03:15:11] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.24s)
  -> </think> found in response 25
Running inference 26... 127.0.0.1 - - [03/Sep/2025 03:15:13] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (0.69s)
  -> No </think> found in response 26
Running inference 27... 127.0.0.1 - - [03/Sep/2025 03:15:14] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.64s)
  -> </think> found in response 27
Running inference 28... 127.0.0.1 - - [03/Sep/2025 03:15:16] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (1.23s)
  -> </think> found in response 28
Running inference 29... 127.0.0.1 - - [03/Sep/2025 03:15:18] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (3.26s)
  -> </think> found in response 29
Running inference 30... 127.0.0.1 - - [03/Sep/2025 03:15:21] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.71s)
  -> </think> found in response 30

=== BATCH INFERENCE SUMMARY ===
Total inferences: 30
Successful inferences: 30
Missing </think> count: 4
Missing </think> percentage: 13.3%
Average inference time: 2.45s
Total batch time: 73.38s

=== RESPONSES WITHOUT THINKING ===

--- Response 1 (Missing </think>) ---
Here is the analysis of the provided video clip:

{
  "event": true,
  "importance": "high",
  "event-type": "Medical emergencies",
  "description": "A person walks towards a gray car, appears to stumble or collapse near the vehicle, and then is attended to by another individual."
}

--- Response 10 (Missing </think>) ---
Here is the analysis of the provided video:

{
  "event": true,
  "importance": "medium",
  "event-type": "Vandalism",
  "description": "A person is seen approaching the grey car and opening its door. The individual appears to be interacting with the vehicle in a manner that could be considered vandalism, as they are seen near the car door after opening it."
}

--- Response 18 (Missing </think>) ---
Here's the analysis based on the provided video:

{
  "event": true,
  "importance": "medium",
  "event-type": "Vandalism",
  "description": "A person is seen approaching a parked car and opening its door, potentially indicating an act of vandalism or unauthorized access to the vehicle."
}

--- Response 26 (Missing </think>) ---
Here's the analysis of the provided video:

{
  "event": true,
  "importance": "high",
  "event-type": "Medical emergencies",
  "description": "A person enters a vehicle and appears to be in distress, potentially indicating a medical emergency."
}

With this script:

from openai import OpenAI
import http.server
import socketserver
import threading
import socket
import time
import atexit
import argparse

# Configuration
USE_STREAMING = False
ENABLE_THINKING = True

VIDEO_FILE = "external_vids/car_video.mp4"

PROMPT = """You are a Video Analytics Service specializing in event detection and classification.

## YOUR TASK
Analyze the provided video and identify ONLY the predefined event types listed below. Report the single most important event if multiple events occur.

## CONTEXT
The video is captured by a surveillance camera in or around an office building.

## PREDEFINED EVENT TYPES BY IMPORTANCE
LOW IMPORTANCE:
  - No event types for this level

MEDIUM IMPORTANCE:
  - Vandalism: Someone damaging property or public infrastructure

HIGH IMPORTANCE:
  - Medical emergencies: Someone collapses and needs urgent attention
  - Fire/smoke: Visible flames or smoke in the area
  - Fighting: People having a violent physical altercation

## ANALYSIS GUIDELINES

### Detection Rules
1. **Conservative Detection**: Only report events that are clearly visible and verifiable in the video frames
2. **Priority Selection**: When multiple events occur, report ONLY the highest importance event
3. **Strict Categorization**: Use ONLY the predefined event types listed above - never create new categories
4. **No Speculation**: Do not infer events based on:
   - Actions occurring outside the video frame
   - Partially visible subjects
   - Assumptions about what might happen before/after the video
5. **Camera Angle**: Do not assume people are falling because they are at the bottom of the frame, it could be a camera mounted with a top down view.

### Video Quality Considerations
- **Corrupted Pixels**: Ignore artifacts or corrupted sections - such as color distortion or grey backgrounds.
- **Temporal Context**: Analyze the complete sequence of frames to understand event progression

### Response Requirements
Return ONLY a JSON object with this exact structure:

{
  "event": boolean,           // true if any predefined event detected, false otherwise
  "importance": string|null,   // "low"/"medium"/"high" if event=true, null if event=false
  "event-type": string|null,   // exact event type from predefined list if event=true, null if event=false
  "description": string        // brief, factual description of what is visible in the video
}

## EXAMPLES

Example 1 - Event Detected:
{
  "event": true,
  "importance": "medium",
  "event-type": "Vandalism",
  "description": "A man is seen spraying graffiti on the wall"
}

Example 2 - No Event:
{
  "event": false,
  "importance": null,
  "event-type": null,
  "description": "Normal scene with nothing unusual happening"
}
"""

MODEL_NAME = "MiniCPM-V-4_5"

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# Global variable to store server reference
httpd_server = None

# Find available port
def find_free_port(start_port=8080):
    for port in range(start_port, start_port + 100):
        try:
            with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
                s.bind(('localhost', port))
                return port
        except OSError:
            continue
    raise RuntimeError("No free ports available")

# Start a simple HTTP server to serve the video
def start_file_server(port):
    global httpd_server
    handler = http.server.SimpleHTTPRequestHandler
    socketserver.TCPServer.allow_reuse_address = True
    httpd_server = socketserver.TCPServer(("localhost", port), handler)
    print(f"Serving files at http://localhost:{port}")
    httpd_server.serve_forever()

# Cleanup function
def cleanup_server():
    global httpd_server
    if httpd_server:
        httpd_server.shutdown()
        httpd_server.server_close()

# Register cleanup function
atexit.register(cleanup_server)

def run_single_inference(video_url, iteration_num):
    """Run a single inference and return the response text and metrics"""
    messages = [
        {"role": "system",
         "content": [
             {"type": "text", "text": "You are a video analysis assistant."},
         ]
        },
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": video_url}},
                {"type": "text", "text": PROMPT}
            ],
        }
    ]
    
    print(f"Running inference {iteration_num}...", end=" ", flush=True)
    
    # Start timing
    start_time = time.time()
    
    # Non-streaming mode for consistent results
    chat_response = client.chat.completions.create(
        model=f"models/{MODEL_NAME}",
        messages=messages,
        max_completion_tokens=2000,
        stream=False,
        temperature=0.6,
        extra_body={"chat_template_kwargs": {"enable_thinking": ENABLE_THINKING}}
    )
    
    end_time = time.time()
    response_text = chat_response.choices[0].message.content
    
    # Calculate metrics
    total_time = end_time - start_time
    usage = chat_response.usage
    
    print(f"Done ({total_time:.2f}s)")
    
    return response_text, total_time, usage

def main():
    parser = argparse.ArgumentParser(description='Run video inference N times and count missing </think> tags')
    parser.add_argument('n', type=int, help='Number of inferences to run')
    args = parser.parse_args()
    
    N = args.n
    
    print(f"=== Running {N} video inferences ===")
    
    # Find free port and start server
    free_port = find_free_port()
    server_thread = threading.Thread(target=start_file_server, args=(free_port,), daemon=True)
    server_thread.start()
    
    # Wait a moment for server to start
    time.sleep(1)
    
    # Create video URL
    video_url = f"http://localhost:{free_port}/{VIDEO_FILE}"
    print(f"Video will be served at: {video_url}")
    
    # Track results
    responses = []
    total_times = []
    missing_think_count = 0
    
    # Run N inferences
    overall_start_time = time.time()
    
    for i in range(1, N + 1):
        try:
            response_text, inference_time, usage = run_single_inference(video_url, i)
            responses.append(response_text)
            total_times.append(inference_time)
            
            # Check if response contains "</think>"
            if "</think>" not in response_text:
                missing_think_count += 1
                print(f"  -> No </think> found in response {i}")
            else:
                print(f"  -> </think> found in response {i}")
                
        except Exception as e:
            print(f"Error in inference {i}: {e}")
            missing_think_count += 1  # Count errors as missing thinking
    
    overall_end_time = time.time()
    overall_time = overall_end_time - overall_start_time
    
    # Print summary
    print(f"\n=== BATCH INFERENCE SUMMARY ===")
    print(f"Total inferences: {N}")
    print(f"Successful inferences: {len(responses)}")
    print(f"Missing </think> count: {missing_think_count}")
    print(f"Missing </think> percentage: {missing_think_count/N*100:.1f}%")
    print(f"Average inference time: {sum(total_times)/len(total_times):.2f}s" if total_times else "N/A")
    print(f"Total batch time: {overall_time:.2f}s")
    
    # Show responses missing </think> tags
    print(f"\n=== RESPONSES WITHOUT THINKING ===")
    responses_without_thinking = [(i, response) for i, response in enumerate(responses) if "</think>" not in response]
    
    if responses_without_thinking:
        for i, response in responses_without_thinking:
            print(f"\n--- Response {i+1} (Missing </think>) ---")
            print(response)
    else:
        print("All responses contain </think> tags!")

if __name__ == "__main__":
    main()

As we can notice, the problematic no or even short thinking are caused by this sequence of tokens at the start "Here is the ....."

Sep 02 '25 19:09 Magmanat

edit: This is the reference video

I managed to get better results by doing a prefill for the assistant, so at least the model thinks and produces a good answer

=== Running 30 video inferences ===
Serving files at http://localhost:8081
Video will be served at: http://localhost:8081/external_vids/car_video.mp4
Running inference 1... 127.0.0.1 - - [03/Sep/2025 03:46:48] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.81s)
  -> </think> found in response 1
Running inference 2... 127.0.0.1 - - [03/Sep/2025 03:46:51] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (3.45s)
  -> </think> found in response 2
Running inference 3... 127.0.0.1 - - [03/Sep/2025 03:46:55] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (3.62s)
  -> </think> found in response 3
Running inference 4... 127.0.0.1 - - [03/Sep/2025 03:46:58] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (1.72s)
  -> </think> found in response 4
Running inference 5... 127.0.0.1 - - [03/Sep/2025 03:47:00] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (3.17s)
  -> </think> found in response 5
Running inference 6... 127.0.0.1 - - [03/Sep/2025 03:47:03] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.95s)
  -> </think> found in response 6
Running inference 7... 127.0.0.1 - - [03/Sep/2025 03:47:06] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.60s)
  -> </think> found in response 7
Running inference 8... 127.0.0.1 - - [03/Sep/2025 03:47:09] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (1.93s)
  -> </think> found in response 8
Running inference 9... 127.0.0.1 - - [03/Sep/2025 03:47:11] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.88s)
  -> </think> found in response 9
Running inference 10... 127.0.0.1 - - [03/Sep/2025 03:47:13] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (3.27s)
  -> </think> found in response 10
Running inference 11... 127.0.0.1 - - [03/Sep/2025 03:47:17] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (1.75s)
  -> No </think> found in response 11
Running inference 12... 127.0.0.1 - - [03/Sep/2025 03:47:18] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (4.22s)
  -> </think> found in response 12
Running inference 13... 127.0.0.1 - - [03/Sep/2025 03:47:23] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (3.23s)
  -> No </think> found in response 13
Running inference 14... 127.0.0.1 - - [03/Sep/2025 03:47:26] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.44s)
  -> </think> found in response 14
Running inference 15... 127.0.0.1 - - [03/Sep/2025 03:47:28] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (4.57s)
  -> No </think> found in response 15
Running inference 16... 127.0.0.1 - - [03/Sep/2025 03:47:33] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.93s)
  -> </think> found in response 16
Running inference 17... 127.0.0.1 - - [03/Sep/2025 03:47:36] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.40s)
  -> </think> found in response 17
Running inference 18... 127.0.0.1 - - [03/Sep/2025 03:47:38] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.85s)
  -> </think> found in response 18
Running inference 19... 127.0.0.1 - - [03/Sep/2025 03:47:41] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.97s)
  -> </think> found in response 19
Running inference 20... 127.0.0.1 - - [03/Sep/2025 03:47:44] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (1.65s)
  -> </think> found in response 20
Running inference 21... 127.0.0.1 - - [03/Sep/2025 03:47:46] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (3.30s)
  -> No </think> found in response 21
Running inference 22... 127.0.0.1 - - [03/Sep/2025 03:47:49] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (1.89s)
  -> </think> found in response 22
Running inference 23... 127.0.0.1 - - [03/Sep/2025 03:47:51] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (3.10s)
  -> </think> found in response 23
Running inference 24... 127.0.0.1 - - [03/Sep/2025 03:47:54] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.61s)
  -> </think> found in response 24
Running inference 25... 127.0.0.1 - - [03/Sep/2025 03:47:57] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.93s)
  -> </think> found in response 25
Running inference 26... 127.0.0.1 - - [03/Sep/2025 03:48:00] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.94s)
  -> </think> found in response 26
Running inference 27... 127.0.0.1 - - [03/Sep/2025 03:48:03] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.71s)
  -> </think> found in response 27
Running inference 28... 127.0.0.1 - - [03/Sep/2025 03:48:05] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (2.55s)
  -> </think> found in response 28
Running inference 29... 127.0.0.1 - - [03/Sep/2025 03:48:08] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (3.92s)
  -> No </think> found in response 29
Running inference 30... 127.0.0.1 - - [03/Sep/2025 03:48:12] "GET /external_vids/car_video.mp4 HTTP/1.1" 200 -
Done (3.35s)
  -> </think> found in response 30

=== BATCH INFERENCE SUMMARY ===
Total inferences: 30
Successful inferences: 30
Missing </think> count: 5
Missing </think> percentage: 16.7%
Average inference time: 2.89s
Total batch time: 86.73s

=== RESPONSES WITHOUT THINKING ===

--- Response 11 (Missing </think>) ---
 analyze the video and identify the most important predefined event type.

The video shows a gray car parked in a lot, with its headlights on. A person appears, walking towards the gray car, and then opens the driver's side door. The video does not show any of the high-importance events like medical emergencies, fire/smoke, or fighting.

The events in the video are:
- A person walking towards a car
- A person opening a car door

None of these events match the high-importance event types: Medical emergencies, Fire/smoke, Fighting.

Since there is no high-importance event, the event is false.

{
  "event": false,
  "importance": null,
  "event-type": null,
  "description": "A person walks towards a parked gray car and opens the driver's side door."
}


--- Response 13 (Missing </think>) ---
 analyze the provided video to identify any predefined event types and report the single most important event if multiple events occur.

First, I will carefully examine the video frame by frame, paying attention to the actions and behaviors of the subjects, as well as any potential hazards or unusual occurrences.

The video shows a surveillance camera capturing a scene in a parking lot. In the initial frames, there are two cars parked on the right side of the frame, one grey and one white. The ground is concrete with visible cracks and some leaves scattered around. On the left side of the frame, there is a partial view of a structure.

As the video progresses, a woman wearing a striped shirt, grey pants, and white shoes appears from the bottom of the frame. She is walking towards the grey car, carrying a bag in her right hand. She approaches the grey car, opens the driver's side door, and begins to get inside.

I will now apply the analysis guidelines and detection rules to identify any predefined events.

The video does not show any clear instances of vandalism, medical emergencies, fighting, fire, or smoke. The woman's actions are consistent with normal behavior, such as entering a car after parking.

Since no events of high or medium importance are detected, I will report the absence of any predefined event.

{
  "event": false,
  "importance": null,
  "event-type": null,
  "description": "A woman walks from the bottom of the frame towards a grey car, opens the driver's side door, and begins to enter the vehicle. The scene is a parking lot with two cars and no unusual events."
}

--- Response 15 (Missing </think>) ---
 analyze the video carefully to identify any predefined events. Let me start by looking at the timestamps and the actions of the people in the scene.

The video starts with a view of two cars parked on a concrete surface. The timestamp is 2025-09-02 19:1:3:02. The camera angle is fixed, showing the cars and the ground between them. There is no unusual activity visible in these initial frames.

As the video progresses, the timestamp reaches 2025-09-02 19:1:3:12. At this point, a person wearing a striped shirt and grey pants appears. They are walking towards the grey car, carrying a bag. This action is clear and consistent with typical behavior in a parking area.

Continuing to the timestamp 2025-09-02 19:1:3:23, the person reaches the driver's side door of the grey car and opens it. This action is deliberate and matches the context of approaching a vehicle.

In the final part of the video, up to the timestamp 2025-09-02 19:1:3:32, the person is seen standing next to the open door of the grey car, preparing to enter. The car's lights are on, and the scene remains consistent with a normal parking area.

Now, let's apply the predefined event detection rules:

1. **Conservative Detection**: The events are clearly visible and verifiable in the video frames.
2. **Priority Selection**: Only one event type is detected.
3. **Strict Categorization**: The event type matches one of the predefined categories.
4. **No Speculation**: No assumptions are made beyond what is visible in the frames.

The most important event in the video is the person approaching and entering the car. This action does not fit any of the predefined "LOW", "MEDIUM", or "HIGH" importance events such as vandalism, medical emergencies, or fighting. Therefore, no predefined event type is reported.

{
  "event": false,
  "importance": null,
  "event-type": null,
  "description": "A person approaches and enters a grey car in a parking area."
}

--- Response 21 (Missing </think>) ---
 analyze the video to determine if any of the predefined event types occur. I'll examine each frame and compare it to the event definitions provided.

The video begins with a view of a parking area, showing two cars parked side by side. The ground is concrete, and there's a drainage grate on the left side. The timestamp in the top right corner indicates the date and time. The scene is calm, with no unusual activity visible.

Then, a person wearing a striped shirt and grey pants appears in the frame. They are walking towards the grey car, holding a bag in one hand. As they approach the car, they reach for the door handle and open the door. The car's headlights are on, and the person appears to be preparing to enter the vehicle.

I will now check the event definitions to see if any match the observed actions. The events listed are Vandalism, Medical emergencies, Fire/smoke, and Fighting. None of these match the actions of a person approaching and opening a car door.

The event "Vandalism" is not present, as there is no damage to property or infrastructure. "Medical emergencies" is not present, as no one collapses or requires urgent attention. "Fire/smoke" is not present, as there are no flames or smoke visible. "Fighting" is not present, as there is no violent physical altercation.

Since none of the predefined event types are detected, the event flag should be false.

{
  "event": false,
  "importance": null,
  "event-type": null,
  "description": "A person approaches a grey car, opens the door, and prepares to enter."
}

--- Response 29 (Missing </think>) ---
 analyze the video carefully to detect any predefined events. Let me start by examining each frame and noting the actions and objects present.

The video shows a parking lot with two cars: a grey SUV and a white sedan. The ground is concrete, and there is a pillar on the left side of the frame. The timestamp at the top indicates the date and time.

At the beginning of the video, the grey SUV’s headlights are on, but there is no significant action occurring yet.

Then, a woman wearing a striped shirt, grey pants, and white shoes enters the frame from the right side. She is carrying a light-colored bag over her shoulder. She walks towards the grey SUV and approaches the driver’s side door.

Next, she opens the driver’s side door of the grey SUV. There is no sign of any unusual behavior or damage to the property.

The scene remains calm, with the main action being the woman approaching and opening the car door. I need to check if any of the predefined events are present.

I will now evaluate the events according to the predefined types.

- **Vandalism**: No one is damaging property or public infrastructure.
- **Medical emergencies**: No one is collapsing or showing signs of needing urgent attention.
- **Fire/smoke**: There is no visible fire or smoke in the area.
- **Fighting**: No violent physical altercation is occurring.

The only notable action is the woman approaching and opening the car door. Since none of the high or medium importance events are present, the most important event is the action of approaching and opening the car door. However, this is not one of the predefined event types.

According to the guidelines, I must report only predefined events. Since no predefined event type is detected, the result should be "No event."

{
  "event": false,
  "importance": null,
  "event-type": null,
  "description": "A woman approaches the grey SUV and opens the driver's side door."
}

Change in code:

messages = [
        {"role": "system",
         "content": [
             {"type": "text", "text": "You are a video analysis assistant."},
         ]
        },
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": video_url}},
                {"type": "text", "text": PROMPT}
            ],
        },
        {
            "role": "assistant",
            "content": [
                {"type": "text", "text": "<think>\nI need to"}
            ]
        }
    ]


chat_response = client.chat.completions.create(
        model=f"models/{MODEL_NAME}",
        messages=messages,
        max_completion_tokens=2000,
        stream=False,
        temperature=0.6,
        extra_body={"chat_template_kwargs": {"enable_thinking": ENABLE_THINKING},
                    "add_generation_prompt": False,
                    "continue_final_message": True}
    )

but somehow this still results in tokens not being generated consistently before the answer is output

Sep 02 '25 19:09 Magmanat

I change the chat template to

{%- set enable_thinking = enable_thinking | default(false) %}
{%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- messages[0].content + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}

{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
        {%- set ns.multi_step_tool = false %}
        {%- set ns.last_query_index = index %}
    {%- endif %}
{%- endfor %}

{%- for message in messages %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set content = message.content %}
        {%- set reasoning_content = '' %}
        
        {#- CHANGED: If this is the last message, just output it as-is #}
        {%- if loop.last %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- else %}
            {#- For non-last messages, keep the original logic #}
            {%- if message.reasoning_content is defined and message.reasoning_content is not none %}
                {%- set reasoning_content = message.reasoning_content %}
            {%- else %}
                {%- if '</think>' in message.content %}
                    {%- set content = message.content.split('</think>')[-1].lstrip('\n') %}
                    {%- set reasoning_content = message.content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
                {%- endif %}
            {%- endif %}
            
            {%- if loop.index0 > ns.last_query_index %}
                {%- if reasoning_content %}
                    {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
                {%- else %}
                    {{- '<|im_start|>' + message.role + '\n' + content }}
                {%- endif %}
            {%- else %}
                {{- '<|im_start|>' + message.role + '\n' + content }}
            {%- endif %}
        {%- endif %}

        {%- if message.tool_calls %}
            {%- for tool_call in message.tool_calls %}
                {%- if (loop.first and content) or (not loop.first) %}
                    {{- '\n' }}
                {%- endif %}
                {%- if tool_call.function %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<tool_call>\n{"name": "' }}
                {{- tool_call.name }}
                {{- '", "arguments": ' }}
                {%- if tool_call.arguments is string %}
                    {{- tool_call.arguments }}
                {%- else %}
                    {{- tool_call.arguments | tojson }}
                {%- endif %}
                {{- '}\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- message.content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}

{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- endif %}
    {%- if enable_thinking is defined and enable_thinking is true %}
        {{- '<think>\n' }}
    {%- endif %}
{%- endif %}

Which allows me to prefill whatever I want at the assistant. This is because I noticed with the template in vllm for minicpm4_5 currently it will result in this if i try to prefill.

<|im_start|>assistant
<think>

</think>

<think>
I need to

After doing this change and prefilling the thinking with "I need to"

Now I remove all the occurrences of no </think> appearing

=== BATCH INFERENCE SUMMARY ===
Total inferences: 30
Successful inferences: 30
Missing </think> count: 0
Missing </think> percentage: 0.0%
Average inference time: 3.31s
Total batch time: 99.44s

Sep 02 '25 21:09 Magmanat

I'm glad to see that you seem to have resolved the issue.Please review my understanding to see if it's correct. If you have any further questions, please feel free to ask me through issue.

Sep 03 '25 11:09 tc-mb

[vllm] - <title>How to use fast thinking in vllm?

起始日期 | Start Date

实现PR | Implementation PR

相关Issues | Reference Issues

摘要 | Summary

基本示例 | Basic Example

缺陷 | Drawbacks

未解决问题 | Unresolved questions

controllable hybrid fast/deep thinking