[Feature] Prevent OOM Crashes in sglang with Large Batches or Image Inputs

Open yhyang201 opened this issue 7 months ago • 1 comments

Checklist

[x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[x] 2. Please use English, otherwise it will be closed.

Motivation

I tried using the OpenAI batches API, but noticed that when the number of requests becomes very large, it's quite easy to run into OOM (out-of-memory) issues, which causes sglang to crash.
I've also seen similar OOM crashes in sglang when using MLLM and sending requests with large images.

Do you think it's necessary to proactively prevent these cases? If so, what would be a good approach to handle them?

from sglang.utils import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

import json
import time
from openai import OpenAI


server_process, port = launch_server_cmd(
    "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --mem-fraction-static 0.8 --port 8000" 
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")

client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

requests = [
    {
        "custom_id": f"request-{i}",
        "method": "POST",
        "url": "/chat/completions",
        "body": {
            "model": "qwen/qwen2.5-0.5b-instruct",
            "messages": [{"role": "user", "content": "What is Python?"}],
            "max_tokens": 50,
        },
    } for i in range(10000)
]

input_file_path = "batch_requests2.jsonl"

with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    file_response = client.files.create(file=f, purpose="batch")

batch_response = client.batches.create(
    input_file_id=file_response.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print_highlight(f"Batch job created with ID: {batch_response.id}")

Related resources

No response

May 12 '25 15:05 yhyang201

Could you try --disable-fast-image-processor and --grammar-backend none? It should completely offload image preprocessing to CPU and reduce VRAM footprint I think

May 13 '25 03:05 m0g1cian