sglang
sglang copied to clipboard
[Feature] Prevent OOM Crashes in sglang with Large Batches or Image Inputs
Checklist
- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.
Motivation
- I tried using the OpenAI batches API, but noticed that when the number of requests becomes very large, it's quite easy to run into OOM (out-of-memory) issues, which causes
sglangto crash. - I've also seen similar OOM crashes in
sglangwhen using MLLM and sending requests with large images.
Do you think it's necessary to proactively prevent these cases? If so, what would be a good approach to handle them?
from sglang.utils import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
import json
import time
from openai import OpenAI
server_process, port = launch_server_cmd(
"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --mem-fraction-static 0.8 --port 8000"
)
wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")
client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
requests = [
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/chat/completions",
"body": {
"model": "qwen/qwen2.5-0.5b-instruct",
"messages": [{"role": "user", "content": "What is Python?"}],
"max_tokens": 50,
},
} for i in range(10000)
]
input_file_path = "batch_requests2.jsonl"
with open(input_file_path, "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
with open(input_file_path, "rb") as f:
file_response = client.files.create(file=f, purpose="batch")
batch_response = client.batches.create(
input_file_id=file_response.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print_highlight(f"Batch job created with ID: {batch_response.id}")
Related resources
No response
Could you try --disable-fast-image-processor and --grammar-backend none? It should completely offload image preprocessing to CPU and reduce VRAM footprint I think