What does this PR do?

This PR adds production-grade server capabilities to Llama Stack by integrating Gunicorn with Uvicorn workers on Unix-based systems (Linux, macOS). The implementation provides multi-process concurrency, worker recycling to prevent memory leaks, and high-throughput performance while maintaining backward compatibility with Windows through automatic fallback to single-process Uvicorn.

Key Features:

Multi-process server: Automatically uses Gunicorn with Uvicorn workers on Unix systems
High performance: Tested at 698+ requests/second with sub-millisecond response times using locust
Configurable via environment variables: All Gunicorn parameters (workers, connections, timeouts, etc.) can be configured
Worker recycling: Prevents memory leaks through automatic worker restart after configurable request counts
Platform detection: Gracefully falls back to Uvicorn on Windows
Production-ready defaults: Sensible defaults based on CPU cores, with override options

Implementation Details

Code Changes:

Modified src/llama_stack/cli/stack/run.py to add _run_with_gunicorn() method with platform detection
Added gunicorn>=23.0.0 dependency to pyproject.toml
Removed disallowed import logging usage, replaced with numeric constants for log level mapping for loglevel propagation to gunicorn
Implemented proper IPv6 address formatting for bind addresses

Environment Variables Added:

GUNICORN_WORKERS / WEB_CONCURRENCY: Number of worker processes (default: (2 * CPU cores) + 1)
GUNICORN_WORKER_CONNECTIONS: Max concurrent connections per worker (default: 1000)
GUNICORN_TIMEOUT: Worker timeout in seconds (default: 120)
GUNICORN_KEEPALIVE: Connection keepalive in seconds (default: 5)
GUNICORN_MAX_REQUESTS: Restart workers after N requests (default: 10000)
GUNICORN_MAX_REQUESTS_JITTER: Randomize worker restart timing (default: 1000)
GUNICORN_PRELOAD: Preload app before forking workers (default: true)

Documentation Updates:

Added production server configuration section to docs/docs/distributions/starting_llama_stack_server.mdx
Updated server configuration docs in docs/docs/distributions/configuration.mdx
Added production features overview to docs/docs/deploying/index.mdx
Updated distribution-specific docs: starter.md
Documented database race condition warning and mitigation (GUNICORN_PRELOAD=true)

Closes #3883

Test Plan

1. Basic Functionality Test

Verify the server starts correctly with Gunicorn on Unix systems:

# Install dependencies
uv sync --group unit --group test

# Start the server with Gunicorn (Unix/Linux/macOS)
GUNICORN_WORKERS=4 GUNICORN_PRELOAD=true uv run llama stack run src/llama_stack/distributions/starter/run.yaml

Oct 29 '25 15:10 r-bit-rry

Hi @r-bit-rry!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Oct 29 '25 15:10 meta-cla[bot]

When running in test mode with Gunicorn: Multiple worker processes are spawned Each worker has separate telemetry instrumentation The mock OTLP collector can't capture spans from all workers Tests expect single-process telemetry collection.

Oct 30 '25 15:10 r-bit-rry

The mock OTLP collector is a basic abstraction and we are really trying to keep it as simple as possible for it to work as nothing more than a testing fixture. For the sake of not burning too much time on it, can we run the integration tests with just a single worker, split out the telemetry tests, or split out tests for multiple workers into their own workflow? Either solution should solve the problem.

Nov 04 '25 21:11 iamemilio

This looks good, my only comment would to fail the server start if any of the metadata store is SQLITE AND gunicorn is used. If we have multiple workers with each their own connection to the DB trying to write, we might be exposed to error locks etc. We must make sure SQLITE isn't use, any other store is ok.

Thanks!

I've added further documentation, there should not be a race condition leading to locking, and I'm not sure SQLite will be used in true production scenario, and in the other case I'm ok with it being used. anyway we have a 5 sec release timer for the locks. Let me know if this is enough

Nov 05 '25 08:11 r-bit-rry

High performance: Tested at 698+ requests/second with sub-millisecond response times using locust

can you also report on the number with uvicorn and same # of workers?

Nov 05 '25 22:11 ehhuang

I'm also wondering if there's a reason to keep both gunicorn and uvicorn.
We recently added a workers param in run config under server.workers, which we should respect or remove depending on the final implementation.

Nov 05 '25 22:11 ehhuang

This looks good, my only comment would to fail the server start if any of the metadata store is SQLITE AND gunicorn is used. If we have multiple workers with each their own connection to the DB trying to write, we might be exposed to error locks etc. We must make sure SQLITE isn't use, any other store is ok. Thanks!

I've added further documentation, there should not be a race condition leading to locking, and I'm not sure SQLite will be used in true production scenario, and in the other case I'm ok with it being used. anyway we have a 5 sec release timer for the locks. Let me know if this is enough

Actually, after looking into https://github.com/llamastack/llama-stack/pull/4048 I'd like to take back what I said. I also agree that SQLITE is not a production target. What I'm asking is really some additional logging information if users happen to have both SQLITE AND gunicorn turned on.

Nov 06 '25 08:11 leseb

I'm also wondering if there's a reason to keep both gunicorn and uvicorn.

AFAIK Gunicorn is just a process manager but we still ned Uvicorn for the ASGI server. Please correct me if I'm wrong. Also when Gunicorn runs it calls -k uvicorn.workers.UvicornWorker.

We recently added a workers param in run config under server.workers, which we should respect or remove depending on the final implementation.

Nov 06 '25 09:11 leseb

@r-bit-rry which llm did you use for this?

I used a locally running lmstudio backend Platform: macOS Darwin 25.1.0 Model ID: qwen/qwen3-30b-a3b-2507 Quantization: 4bit Compatibility: MLX Max Context: 262,144 tokens (loaded: 4,096) Provider Type: remote::openai Provider ID: lmstudio Base URL: http://127.0.0.1:1234/v1 API Compatibility: OpenAI v1 ( I did not have to introduce any changes, this does not validate the lmstudio as a backend)

Uvicorn Single-Process Mode Command: LLAMA_STACK_ENABLE_GUNICORN=false llama stack run Workers: 4 (configured, 1 active) Request: Single message, 50 token limit

Gunicorn Multi-Process Mode configuration and responses Server: gunicorn 23.0.0 Worker Class: uvicorn.workers.UvicornWorker Workers: 25 (calculated: 2 * CPU cores + 1) Worker Connections: 1,000 per worker Request: Chat completion, 100 token limit

Nov 09 '25 17:11 r-bit-rry

High performance: Tested at 698+ requests/second with sub-millisecond response times using locust

can you also report on the number with uvicorn and same # of workers?

@ehhuang Uvicorn vs Gunicorn Performance: Nearly Identical for Light Workloads Throughput: 1,052 req/s (both configurations within 0.002% of each other) Latency: Identical average (1ms), but Gunicorn shows 33-40% better tail latencies (P95/P99) Reliability: Both achieved 0% failure rate

Note: I'm running this on a new machine, stronger, thats why the different numbers from the first test.

Metric	Uvicorn	Gunicorn	Difference
Throughput	1,052.00 req/s	1,051.98 req/s	-0.02 req/s (-0.002%)
Avg Latency	1ms	1ms	0ms
P95 Latency	3ms	2ms	-1ms (+33% better)
P99 Latency	5ms	3ms	-2ms (+40% better)
Total Requests	62,954	62,692	-262 (-0.4%)

Nov 17 '25 10:11 r-bit-rry

@r-bit-rry great work on this. what's the perf test harness? what's a config where gunicorn will clearly shine?

Nov 17 '25 13:11 mattf

@r-bit-rry great work on this. what's the perf test harness? what's a config where gunicorn will clearly shine?

hardware wise: high core machine (16+ cores), probably cpu intensive (such as cpu serving of embeddings or inference, not necessarily a production target), long running server and batch jobs.

On a side note, I'm actively looking for provisioning of proper lab hardware in my team to demonstrate these kind of efforts and others.

Nov 18 '25 08:11 r-bit-rry

@r-bit-rry great work on this. what's the perf test harness? what's a config where gunicorn will clearly shine?

hardware wise: high core machine (16+ cores), probably cpu intensive (such as cpu serving of embeddings or inference, not necessarily a production target), long running server and batch jobs.

On a side note, I'm actively looking for provisioning of proper lab hardware in my team to demonstrate these kind of efforts and others.

earlier this year a microbenchmark of creating OpenAIClients suggested a limit of 150rps.

being able to do 1k rps is unbelievable good and means a single stack server can drive a business worth of compute in 2025 (sans an inference as a service business, they'll want).

the results suggest deployers shouldn't bother with gunicorn simply for perf.

Nov 18 '25 13:11 mattf

This pull request has merge conflicts that must be resolved before it can be merged. @r-bit-rry please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Nov 18 '25 19:11 mergify[bot]

@ashwinb we are in the process of setting up a proper lab for high load PoC tests. I will report back once we have it bench-marked properly

Dec 02 '25 12:12 r-bit-rry

✱ Stainless preview builds

This PR will update the llama-stack-client SDKs with the following commit message.

feat(cli): use gunicorn to manage server workers on unix systems

Edit this comment to update it. It will appear in the SDK's changelogs.

✅ llama-stack-client-node studio · code · diff

Your SDK built successfully. generate ⚠️ → build ✅ → lint ✅ → test ✅
npm install https://pkg.stainless.com/s/llama-stack-client-node/5fdec7f5dd2d5f77ab8b1ded2684d1b34fb4728a/dist.tar.gz

✅ llama-stack-client-kotlin studio · code · diff

Your SDK built successfully. generate ⚠️ → lint ✅ → test ❗

⏳ llama-stack-client-python studio · code · diff

generate ⚠️ → build ⏳ → lint ⏳ → test ⏳

✅ llama-stack-client-go studio · code · diff

Your SDK built successfully. generate ⚠️ → lint ❗ → test ❗
go get github.com/stainless-sdks/llama-stack-client-go@6db1707d44117c2b7f8b1f5f277b19bbac0e7861

⏳ These are partial results; builds are still running.

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2025-12-04 11:18:10 UTC

Dec 04 '25 11:12 github-actions[bot]

@r-bit-rry which llm did you use for this? @mattf @cdoern @leseb I would love to get your input on the results I've just finished running extensive load and performance tests over our Kubernetes cluster. we have 8xA100 cluster I've pushed a single pod to the limit and it seems that there is no clear benefit (performance wise) to gunicorn in this scenario (somewhat expected).

Gunicorn integration provides operational process management features (--max-requests, graceful reload, timeout handling) that may be valuable for specific production scenarios where these capabilities are needed. For pure performance, Uvicorn with --workers 16 was found equivalent.

Best use case for gunicorn would be bare metal machine or a single VM, in case of Kubernetes cluster I would recommend sticking to provided scaling mechanisms over uvicorn.

Comprehensive load testing across 760,000+ requests at baseline (20-200 users), high (300-1500 users), and extreme (2000-4000 users) load levels reveals:

Test Environment

Platform: OpenShift cluster
Model: vllm-inference/RedHatAI/gpt-oss-20b via shared vLLM backend
Framework: Locust with FastHttpUser for high-performance HTTP testing
Total Requests Tested: 760,883 requests across all test phases

Performance Results Summary

Baseline Load (20-200 Users) - 600K+ Requests

Configuration	Workers	Success Rate	Avg Response	P50	P95	P99	Throughput
Uvicorn	4	100%	553ms	580ms	770ms	800ms	104.7 req/s
Gunicorn	4	100%	549ms	580ms	760ms	790ms	105.8 req/s
Gunicorn	16	100%	547ms	570ms	760ms	790ms	106.1 req/s

High Load (300-1500 Users) - 580K Requests

Configuration	Workers	Total Requests	Failure Rate	Avg Response	P95	P99	Throughput
Uvicorn	4	290,947	0.54%	2,077ms	4,500ms	5,300ms	329 req/s
Gunicorn	16	289,547	0.00%	2,093ms	4,100ms	4,600ms	302 req/s

Conclusion: Worker count becomes critical under stress. Gunicorn 16w achieved 0% failure rate (289,547/289,547 succeeded) while Uvicorn 4w experienced 1,558 failures at 1200+ users. Trade-off: -8% throughput for 100% reliability.

Extreme Load (2000-4000 Users) - 365K Requests

Configuration	Workers	Total Requests	Failure Rate	Avg Response	P95	P99	Throughput
Uvicorn	16	182,588	0.12%	8,357ms	16,000ms	17,000ms	303 req/s
Gunicorn	16	182,748	0.15%	8,357ms	17,000ms	18,000ms	305 req/s

Conclusion: At equal worker counts (16w), server architecture becomes negligible. Both hit vLLM backend saturation at ~2000 users with identical failure rates (0.12% vs 0.15%) and response times (8,357ms). This validates that the high-load advantage was due to worker count (4w vs 16w), not server type (Uvicorn vs Gunicorn).

Note on defaults: Gunicorn's default --timeout=30s will kill workers handling long-running LLM requests (document summarization, long-form generation).

Empirically Validated ✅

Performance parity at equal worker counts (760K+ requests tested)
16 workers required for high-load stability (0% failure vs 0.54% at 1500 users)
vLLM backend is the bottleneck (minimal improvement from 4w to 16w at baseline)
Worker recovery works for both (killed workers respawned by both Uvicorn and Gunicorn, which is a good surprise for uvicorn)

Gunicorn Documentation suggests, but did not test ⚠️

Memory leak prevention (--max-requests)
Graceful reload (SIGHUP)
Worker timeout handling
TTFT and streaming behavior

Dec 17 '25 19:12 r-bit-rry