feat(cli): use gunicorn to manage server workers on unix systems
What does this PR do?
This PR adds production-grade server capabilities to Llama Stack by integrating Gunicorn with Uvicorn workers on Unix-based systems (Linux, macOS). The implementation provides multi-process concurrency, worker recycling to prevent memory leaks, and high-throughput performance while maintaining backward compatibility with Windows through automatic fallback to single-process Uvicorn.
Key Features:
- Multi-process server: Automatically uses Gunicorn with Uvicorn workers on Unix systems
- High performance: Tested at 698+ requests/second with sub-millisecond response times using locust
- Configurable via environment variables: All Gunicorn parameters (workers, connections, timeouts, etc.) can be configured
- Worker recycling: Prevents memory leaks through automatic worker restart after configurable request counts
- Platform detection: Gracefully falls back to Uvicorn on Windows
- Production-ready defaults: Sensible defaults based on CPU cores, with override options
Implementation Details
Code Changes:
- Modified
src/llama_stack/cli/stack/run.pyto add_run_with_gunicorn()method with platform detection - Added
gunicorn>=23.0.0dependency topyproject.toml - Removed disallowed
import loggingusage, replaced with numeric constants for log level mapping for loglevel propagation to gunicorn - Implemented proper IPv6 address formatting for bind addresses
Environment Variables Added:
-
GUNICORN_WORKERS/WEB_CONCURRENCY: Number of worker processes (default:(2 * CPU cores) + 1) -
GUNICORN_WORKER_CONNECTIONS: Max concurrent connections per worker (default:1000) -
GUNICORN_TIMEOUT: Worker timeout in seconds (default:120) -
GUNICORN_KEEPALIVE: Connection keepalive in seconds (default:5) -
GUNICORN_MAX_REQUESTS: Restart workers after N requests (default:10000) -
GUNICORN_MAX_REQUESTS_JITTER: Randomize worker restart timing (default:1000) -
GUNICORN_PRELOAD: Preload app before forking workers (default:true)
Documentation Updates:
- Added production server configuration section to
docs/docs/distributions/starting_llama_stack_server.mdx - Updated server configuration docs in
docs/docs/distributions/configuration.mdx - Added production features overview to
docs/docs/deploying/index.mdx - Updated distribution-specific docs:
starter.md - Documented database race condition warning and mitigation (
GUNICORN_PRELOAD=true)
Closes #3883
Test Plan
1. Basic Functionality Test
Verify the server starts correctly with Gunicorn on Unix systems:
# Install dependencies
uv sync --group unit --group test
# Start the server with Gunicorn (Unix/Linux/macOS)
GUNICORN_WORKERS=4 GUNICORN_PRELOAD=true uv run llama stack run src/llama_stack/distributions/starter/run.yaml
Hi @r-bit-rry!
Thank you for your pull request and welcome to our community.
Action Required
In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!
When running in test mode with Gunicorn: Multiple worker processes are spawned Each worker has separate telemetry instrumentation The mock OTLP collector can't capture spans from all workers Tests expect single-process telemetry collection.
The mock OTLP collector is a basic abstraction and we are really trying to keep it as simple as possible for it to work as nothing more than a testing fixture. For the sake of not burning too much time on it, can we run the integration tests with just a single worker, split out the telemetry tests, or split out tests for multiple workers into their own workflow? Either solution should solve the problem.
This looks good, my only comment would to fail the server start if any of the metadata store is SQLITE AND gunicorn is used. If we have multiple workers with each their own connection to the DB trying to write, we might be exposed to error locks etc. We must make sure SQLITE isn't use, any other store is ok.
Thanks!
I've added further documentation, there should not be a race condition leading to locking, and I'm not sure SQLite will be used in true production scenario, and in the other case I'm ok with it being used. anyway we have a 5 sec release timer for the locks. Let me know if this is enough
High performance: Tested at 698+ requests/second with sub-millisecond response times using locust
can you also report on the number with uvicorn and same # of workers?
- I'm also wondering if there's a reason to keep both gunicorn and uvicorn.
- We recently added a
workersparam in run config underserver.workers, which we should respect or remove depending on the final implementation.
This looks good, my only comment would to fail the server start if any of the metadata store is SQLITE AND gunicorn is used. If we have multiple workers with each their own connection to the DB trying to write, we might be exposed to error locks etc. We must make sure SQLITE isn't use, any other store is ok. Thanks!
I've added further documentation, there should not be a race condition leading to locking, and I'm not sure SQLite will be used in true production scenario, and in the other case I'm ok with it being used. anyway we have a 5 sec release timer for the locks. Let me know if this is enough
Actually, after looking into https://github.com/llamastack/llama-stack/pull/4048 I'd like to take back what I said. I also agree that SQLITE is not a production target. What I'm asking is really some additional logging information if users happen to have both SQLITE AND gunicorn turned on.
- I'm also wondering if there's a reason to keep both gunicorn and uvicorn.
AFAIK Gunicorn is just a process manager but we still ned Uvicorn for the ASGI server. Please correct me if I'm wrong. Also when Gunicorn runs it calls -k uvicorn.workers.UvicornWorker.
- We recently added a
workersparam in run config underserver.workers, which we should respect or remove depending on the final implementation.
@r-bit-rry which llm did you use for this?
I used a locally running lmstudio backend Platform: macOS Darwin 25.1.0 Model ID: qwen/qwen3-30b-a3b-2507 Quantization: 4bit Compatibility: MLX Max Context: 262,144 tokens (loaded: 4,096) Provider Type: remote::openai Provider ID: lmstudio Base URL: http://127.0.0.1:1234/v1 API Compatibility: OpenAI v1 ( I did not have to introduce any changes, this does not validate the lmstudio as a backend)
Uvicorn Single-Process Mode Command: LLAMA_STACK_ENABLE_GUNICORN=false llama stack run Workers: 4 (configured, 1 active) Request: Single message, 50 token limit
Gunicorn Multi-Process Mode configuration and responses Server: gunicorn 23.0.0 Worker Class: uvicorn.workers.UvicornWorker Workers: 25 (calculated: 2 * CPU cores + 1) Worker Connections: 1,000 per worker Request: Chat completion, 100 token limit
High performance: Tested at 698+ requests/second with sub-millisecond response times using locust
can you also report on the number with uvicorn and same # of workers?
@ehhuang Uvicorn vs Gunicorn Performance: Nearly Identical for Light Workloads Throughput: 1,052 req/s (both configurations within 0.002% of each other) Latency: Identical average (1ms), but Gunicorn shows 33-40% better tail latencies (P95/P99) Reliability: Both achieved 0% failure rate
Note: I'm running this on a new machine, stronger, thats why the different numbers from the first test.
| Metric | Uvicorn | Gunicorn | Difference |
|---|---|---|---|
| Throughput | 1,052.00 req/s | 1,051.98 req/s | -0.02 req/s (-0.002%) |
| Avg Latency | 1ms | 1ms | 0ms |
| P95 Latency | 3ms | 2ms | -1ms (+33% better) |
| P99 Latency | 5ms | 3ms | -2ms (+40% better) |
| Total Requests | 62,954 | 62,692 | -262 (-0.4%) |
@r-bit-rry great work on this. what's the perf test harness? what's a config where gunicorn will clearly shine?
@r-bit-rry great work on this. what's the perf test harness? what's a config where gunicorn will clearly shine?
hardware wise: high core machine (16+ cores), probably cpu intensive (such as cpu serving of embeddings or inference, not necessarily a production target), long running server and batch jobs.
On a side note, I'm actively looking for provisioning of proper lab hardware in my team to demonstrate these kind of efforts and others.
@r-bit-rry great work on this. what's the perf test harness? what's a config where gunicorn will clearly shine?
hardware wise: high core machine (16+ cores), probably cpu intensive (such as cpu serving of embeddings or inference, not necessarily a production target), long running server and batch jobs.
On a side note, I'm actively looking for provisioning of proper lab hardware in my team to demonstrate these kind of efforts and others.
earlier this year a microbenchmark of creating OpenAIClients suggested a limit of 150rps.
being able to do 1k rps is unbelievable good and means a single stack server can drive a business worth of compute in 2025 (sans an inference as a service business, they'll want).
the results suggest deployers shouldn't bother with gunicorn simply for perf.
This pull request has merge conflicts that must be resolved before it can be merged. @r-bit-rry please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
@ashwinb we are in the process of setting up a proper lab for high load PoC tests. I will report back once we have it bench-marked properly
✱ Stainless preview builds
This PR will update the llama-stack-client SDKs with the following commit message.
feat(cli): use gunicorn to manage server workers on unix systems
Edit this comment to update it. It will appear in the SDK's changelogs.
✅ llama-stack-client-node studio · code · diff
Your SDK built successfully.
generate ⚠️→build ✅→lint ✅→test ✅npm install https://pkg.stainless.com/s/llama-stack-client-node/5fdec7f5dd2d5f77ab8b1ded2684d1b34fb4728a/dist.tar.gz
✅ llama-stack-client-go studio · code · diff
Your SDK built successfully.
generate ⚠️→lint ❗→test ❗go get github.com/stainless-sdks/llama-stack-client-go@6db1707d44117c2b7f8b1f5f277b19bbac0e7861
⏳ These are partial results; builds are still running.
This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2025-12-04 11:18:10 UTC
@r-bit-rry which llm did you use for this? @mattf @cdoern @leseb I would love to get your input on the results I've just finished running extensive load and performance tests over our Kubernetes cluster. we have 8xA100 cluster I've pushed a single pod to the limit and it seems that there is no clear benefit (performance wise) to gunicorn in this scenario (somewhat expected).
Gunicorn integration provides operational process management features (--max-requests, graceful reload, timeout handling) that may be valuable for specific production scenarios where these capabilities are needed. For pure performance, Uvicorn with --workers 16 was found equivalent.
Best use case for gunicorn would be bare metal machine or a single VM, in case of Kubernetes cluster I would recommend sticking to provided scaling mechanisms over uvicorn.
Comprehensive load testing across 760,000+ requests at baseline (20-200 users), high (300-1500 users), and extreme (2000-4000 users) load levels reveals:
Test Environment
- Platform: OpenShift cluster
-
Model:
vllm-inference/RedHatAI/gpt-oss-20bvia shared vLLM backend -
Framework: Locust with
FastHttpUserfor high-performance HTTP testing - Total Requests Tested: 760,883 requests across all test phases
Performance Results Summary
Baseline Load (20-200 Users) - 600K+ Requests
| Configuration | Workers | Success Rate | Avg Response | P50 | P95 | P99 | Throughput |
|---|---|---|---|---|---|---|---|
| Uvicorn | 4 | 100% | 553ms | 580ms | 770ms | 800ms | 104.7 req/s |
| Gunicorn | 4 | 100% | 549ms | 580ms | 760ms | 790ms | 105.8 req/s |
| Gunicorn | 16 | 100% | 547ms | 570ms | 760ms | 790ms | 106.1 req/s |
High Load (300-1500 Users) - 580K Requests
| Configuration | Workers | Total Requests | Failure Rate | Avg Response | P95 | P99 | Throughput |
|---|---|---|---|---|---|---|---|
| Uvicorn | 4 | 290,947 | 0.54% | 2,077ms | 4,500ms | 5,300ms | 329 req/s |
| Gunicorn | 16 | 289,547 | 0.00% | 2,093ms | 4,100ms | 4,600ms | 302 req/s |
Conclusion: Worker count becomes critical under stress. Gunicorn 16w achieved 0% failure rate (289,547/289,547 succeeded) while Uvicorn 4w experienced 1,558 failures at 1200+ users. Trade-off: -8% throughput for 100% reliability.
Extreme Load (2000-4000 Users) - 365K Requests
| Configuration | Workers | Total Requests | Failure Rate | Avg Response | P95 | P99 | Throughput |
|---|---|---|---|---|---|---|---|
| Uvicorn | 16 | 182,588 | 0.12% | 8,357ms | 16,000ms | 17,000ms | 303 req/s |
| Gunicorn | 16 | 182,748 | 0.15% | 8,357ms | 17,000ms | 18,000ms | 305 req/s |
Conclusion: At equal worker counts (16w), server architecture becomes negligible. Both hit vLLM backend saturation at ~2000 users with identical failure rates (0.12% vs 0.15%) and response times (8,357ms). This validates that the high-load advantage was due to worker count (4w vs 16w), not server type (Uvicorn vs Gunicorn).
Note on defaults:
Gunicorn's default --timeout=30s will kill workers handling long-running LLM requests (document summarization, long-form generation).
Empirically Validated ✅
- Performance parity at equal worker counts (760K+ requests tested)
- 16 workers required for high-load stability (0% failure vs 0.54% at 1500 users)
- vLLM backend is the bottleneck (minimal improvement from 4w to 16w at baseline)
- Worker recovery works for both (killed workers respawned by both Uvicorn and Gunicorn, which is a good surprise for uvicorn)
Gunicorn Documentation suggests, but did not test ⚠️
-
Memory leak prevention (
--max-requests) - Graceful reload (SIGHUP)
- Worker timeout handling
- TTFT and streaming behavior