llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

feat(cli): use gunicorn to manage server workers on unix systems

Open r-bit-rry opened this issue 2 months ago • 14 comments

What does this PR do?

This PR adds production-grade server capabilities to Llama Stack by integrating Gunicorn with Uvicorn workers on Unix-based systems (Linux, macOS). The implementation provides multi-process concurrency, worker recycling to prevent memory leaks, and high-throughput performance while maintaining backward compatibility with Windows through automatic fallback to single-process Uvicorn.

Key Features:

  • Multi-process server: Automatically uses Gunicorn with Uvicorn workers on Unix systems
  • High performance: Tested at 698+ requests/second with sub-millisecond response times using locust
  • Configurable via environment variables: All Gunicorn parameters (workers, connections, timeouts, etc.) can be configured
  • Worker recycling: Prevents memory leaks through automatic worker restart after configurable request counts
  • Platform detection: Gracefully falls back to Uvicorn on Windows
  • Production-ready defaults: Sensible defaults based on CPU cores, with override options

Implementation Details

Code Changes:

  • Modified src/llama_stack/cli/stack/run.py to add _run_with_gunicorn() method with platform detection
  • Added gunicorn>=23.0.0 dependency to pyproject.toml
  • Removed disallowed import logging usage, replaced with numeric constants for log level mapping for loglevel propagation to gunicorn
  • Implemented proper IPv6 address formatting for bind addresses

Environment Variables Added:

  • GUNICORN_WORKERS / WEB_CONCURRENCY: Number of worker processes (default: (2 * CPU cores) + 1)
  • GUNICORN_WORKER_CONNECTIONS: Max concurrent connections per worker (default: 1000)
  • GUNICORN_TIMEOUT: Worker timeout in seconds (default: 120)
  • GUNICORN_KEEPALIVE: Connection keepalive in seconds (default: 5)
  • GUNICORN_MAX_REQUESTS: Restart workers after N requests (default: 10000)
  • GUNICORN_MAX_REQUESTS_JITTER: Randomize worker restart timing (default: 1000)
  • GUNICORN_PRELOAD: Preload app before forking workers (default: true)

Documentation Updates:

  • Added production server configuration section to docs/docs/distributions/starting_llama_stack_server.mdx
  • Updated server configuration docs in docs/docs/distributions/configuration.mdx
  • Added production features overview to docs/docs/deploying/index.mdx
  • Updated distribution-specific docs: starter.md
  • Documented database race condition warning and mitigation (GUNICORN_PRELOAD=true)

Closes #3883

Test Plan

1. Basic Functionality Test

Verify the server starts correctly with Gunicorn on Unix systems:

# Install dependencies
uv sync --group unit --group test

# Start the server with Gunicorn (Unix/Linux/macOS)
GUNICORN_WORKERS=4 GUNICORN_PRELOAD=true uv run llama stack run src/llama_stack/distributions/starter/run.yaml

r-bit-rry avatar Oct 29 '25 15:10 r-bit-rry

Hi @r-bit-rry!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

meta-cla[bot] avatar Oct 29 '25 15:10 meta-cla[bot]

When running in test mode with Gunicorn: Multiple worker processes are spawned Each worker has separate telemetry instrumentation The mock OTLP collector can't capture spans from all workers Tests expect single-process telemetry collection.

r-bit-rry avatar Oct 30 '25 15:10 r-bit-rry

The mock OTLP collector is a basic abstraction and we are really trying to keep it as simple as possible for it to work as nothing more than a testing fixture. For the sake of not burning too much time on it, can we run the integration tests with just a single worker, split out the telemetry tests, or split out tests for multiple workers into their own workflow? Either solution should solve the problem.

iamemilio avatar Nov 04 '25 21:11 iamemilio

This looks good, my only comment would to fail the server start if any of the metadata store is SQLITE AND gunicorn is used. If we have multiple workers with each their own connection to the DB trying to write, we might be exposed to error locks etc. We must make sure SQLITE isn't use, any other store is ok.

Thanks!

I've added further documentation, there should not be a race condition leading to locking, and I'm not sure SQLite will be used in true production scenario, and in the other case I'm ok with it being used. anyway we have a 5 sec release timer for the locks. Let me know if this is enough

r-bit-rry avatar Nov 05 '25 08:11 r-bit-rry

High performance: Tested at 698+ requests/second with sub-millisecond response times using locust

can you also report on the number with uvicorn and same # of workers?

ehhuang avatar Nov 05 '25 22:11 ehhuang

  • I'm also wondering if there's a reason to keep both gunicorn and uvicorn.
  • We recently added a workers param in run config under server.workers, which we should respect or remove depending on the final implementation.

ehhuang avatar Nov 05 '25 22:11 ehhuang

This looks good, my only comment would to fail the server start if any of the metadata store is SQLITE AND gunicorn is used. If we have multiple workers with each their own connection to the DB trying to write, we might be exposed to error locks etc. We must make sure SQLITE isn't use, any other store is ok. Thanks!

I've added further documentation, there should not be a race condition leading to locking, and I'm not sure SQLite will be used in true production scenario, and in the other case I'm ok with it being used. anyway we have a 5 sec release timer for the locks. Let me know if this is enough

Actually, after looking into https://github.com/llamastack/llama-stack/pull/4048 I'd like to take back what I said. I also agree that SQLITE is not a production target. What I'm asking is really some additional logging information if users happen to have both SQLITE AND gunicorn turned on.

leseb avatar Nov 06 '25 08:11 leseb

  • I'm also wondering if there's a reason to keep both gunicorn and uvicorn.

AFAIK Gunicorn is just a process manager but we still ned Uvicorn for the ASGI server. Please correct me if I'm wrong. Also when Gunicorn runs it calls -k uvicorn.workers.UvicornWorker.

  • We recently added a workers param in run config under server.workers, which we should respect or remove depending on the final implementation.

leseb avatar Nov 06 '25 09:11 leseb

@r-bit-rry which llm did you use for this?

I used a locally running lmstudio backend Platform: macOS Darwin 25.1.0 Model ID: qwen/qwen3-30b-a3b-2507 Quantization: 4bit Compatibility: MLX Max Context: 262,144 tokens (loaded: 4,096) Provider Type: remote::openai Provider ID: lmstudio Base URL: http://127.0.0.1:1234/v1 API Compatibility: OpenAI v1 ( I did not have to introduce any changes, this does not validate the lmstudio as a backend)

Uvicorn Single-Process Mode Command: LLAMA_STACK_ENABLE_GUNICORN=false llama stack run Workers: 4 (configured, 1 active) Request: Single message, 50 token limit

Gunicorn Multi-Process Mode configuration and responses Server: gunicorn 23.0.0 Worker Class: uvicorn.workers.UvicornWorker Workers: 25 (calculated: 2 * CPU cores + 1) Worker Connections: 1,000 per worker Request: Chat completion, 100 token limit

r-bit-rry avatar Nov 09 '25 17:11 r-bit-rry

High performance: Tested at 698+ requests/second with sub-millisecond response times using locust

can you also report on the number with uvicorn and same # of workers?

@ehhuang Uvicorn vs Gunicorn Performance: Nearly Identical for Light Workloads Throughput: 1,052 req/s (both configurations within 0.002% of each other) Latency: Identical average (1ms), but Gunicorn shows 33-40% better tail latencies (P95/P99) Reliability: Both achieved 0% failure rate

Note: I'm running this on a new machine, stronger, thats why the different numbers from the first test.

Metric Uvicorn Gunicorn Difference
Throughput 1,052.00 req/s 1,051.98 req/s -0.02 req/s (-0.002%)
Avg Latency 1ms 1ms 0ms
P95 Latency 3ms 2ms -1ms (+33% better)
P99 Latency 5ms 3ms -2ms (+40% better)
Total Requests 62,954 62,692 -262 (-0.4%)

r-bit-rry avatar Nov 17 '25 10:11 r-bit-rry

@r-bit-rry great work on this. what's the perf test harness? what's a config where gunicorn will clearly shine?

mattf avatar Nov 17 '25 13:11 mattf

@r-bit-rry great work on this. what's the perf test harness? what's a config where gunicorn will clearly shine?

hardware wise: high core machine (16+ cores), probably cpu intensive (such as cpu serving of embeddings or inference, not necessarily a production target), long running server and batch jobs.

On a side note, I'm actively looking for provisioning of proper lab hardware in my team to demonstrate these kind of efforts and others.

r-bit-rry avatar Nov 18 '25 08:11 r-bit-rry

@r-bit-rry great work on this. what's the perf test harness? what's a config where gunicorn will clearly shine?

hardware wise: high core machine (16+ cores), probably cpu intensive (such as cpu serving of embeddings or inference, not necessarily a production target), long running server and batch jobs.

On a side note, I'm actively looking for provisioning of proper lab hardware in my team to demonstrate these kind of efforts and others.

earlier this year a microbenchmark of creating OpenAIClients suggested a limit of 150rps.

being able to do 1k rps is unbelievable good and means a single stack server can drive a business worth of compute in 2025 (sans an inference as a service business, they'll want).

the results suggest deployers shouldn't bother with gunicorn simply for perf.

mattf avatar Nov 18 '25 13:11 mattf

This pull request has merge conflicts that must be resolved before it can be merged. @r-bit-rry please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 18 '25 19:11 mergify[bot]

@ashwinb we are in the process of setting up a proper lab for high load PoC tests. I will report back once we have it bench-marked properly

r-bit-rry avatar Dec 02 '25 12:12 r-bit-rry

✱ Stainless preview builds

This PR will update the llama-stack-client SDKs with the following commit message.

feat(cli): use gunicorn to manage server workers on unix systems

Edit this comment to update it. It will appear in the SDK's changelogs.

llama-stack-client-node studio · code · diff

Your SDK built successfully. generate ⚠️build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/llama-stack-client-node/5fdec7f5dd2d5f77ab8b1ded2684d1b34fb4728a/dist.tar.gz
llama-stack-client-kotlin studio · code · diff

Your SDK built successfully. generate ⚠️lint ✅test ❗

llama-stack-client-python studio · code · diff

generate ⚠️build ⏳lint ⏳test ⏳

llama-stack-client-go studio · code · diff

Your SDK built successfully. generate ⚠️lint ❗test ❗

go get github.com/stainless-sdks/llama-stack-client-go@6db1707d44117c2b7f8b1f5f277b19bbac0e7861

⏳ These are partial results; builds are still running.


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2025-12-04 11:18:10 UTC

github-actions[bot] avatar Dec 04 '25 11:12 github-actions[bot]

@r-bit-rry which llm did you use for this? @mattf @cdoern @leseb I would love to get your input on the results I've just finished running extensive load and performance tests over our Kubernetes cluster. we have 8xA100 cluster I've pushed a single pod to the limit and it seems that there is no clear benefit (performance wise) to gunicorn in this scenario (somewhat expected).

Gunicorn integration provides operational process management features (--max-requests, graceful reload, timeout handling) that may be valuable for specific production scenarios where these capabilities are needed. For pure performance, Uvicorn with --workers 16 was found equivalent.

Best use case for gunicorn would be bare metal machine or a single VM, in case of Kubernetes cluster I would recommend sticking to provided scaling mechanisms over uvicorn.

Comprehensive load testing across 760,000+ requests at baseline (20-200 users), high (300-1500 users), and extreme (2000-4000 users) load levels reveals:

Test Environment

  • Platform: OpenShift cluster
  • Model: vllm-inference/RedHatAI/gpt-oss-20b via shared vLLM backend
  • Framework: Locust with FastHttpUser for high-performance HTTP testing
  • Total Requests Tested: 760,883 requests across all test phases

Performance Results Summary

Baseline Load (20-200 Users) - 600K+ Requests

Configuration Workers Success Rate Avg Response P50 P95 P99 Throughput
Uvicorn 4 100% 553ms 580ms 770ms 800ms 104.7 req/s
Gunicorn 4 100% 549ms 580ms 760ms 790ms 105.8 req/s
Gunicorn 16 100% 547ms 570ms 760ms 790ms 106.1 req/s

High Load (300-1500 Users) - 580K Requests

Configuration Workers Total Requests Failure Rate Avg Response P95 P99 Throughput
Uvicorn 4 290,947 0.54% 2,077ms 4,500ms 5,300ms 329 req/s
Gunicorn 16 289,547 0.00% 2,093ms 4,100ms 4,600ms 302 req/s

Conclusion: Worker count becomes critical under stress. Gunicorn 16w achieved 0% failure rate (289,547/289,547 succeeded) while Uvicorn 4w experienced 1,558 failures at 1200+ users. Trade-off: -8% throughput for 100% reliability.

Extreme Load (2000-4000 Users) - 365K Requests

Configuration Workers Total Requests Failure Rate Avg Response P95 P99 Throughput
Uvicorn 16 182,588 0.12% 8,357ms 16,000ms 17,000ms 303 req/s
Gunicorn 16 182,748 0.15% 8,357ms 17,000ms 18,000ms 305 req/s

Conclusion: At equal worker counts (16w), server architecture becomes negligible. Both hit vLLM backend saturation at ~2000 users with identical failure rates (0.12% vs 0.15%) and response times (8,357ms). This validates that the high-load advantage was due to worker count (4w vs 16w), not server type (Uvicorn vs Gunicorn).

Note on defaults: Gunicorn's default --timeout=30s will kill workers handling long-running LLM requests (document summarization, long-form generation).

Empirically Validated ✅

  1. Performance parity at equal worker counts (760K+ requests tested)
  2. 16 workers required for high-load stability (0% failure vs 0.54% at 1500 users)
  3. vLLM backend is the bottleneck (minimal improvement from 4w to 16w at baseline)
  4. Worker recovery works for both (killed workers respawned by both Uvicorn and Gunicorn, which is a good surprise for uvicorn)

Gunicorn Documentation suggests, but did not test ⚠️

  1. Memory leak prevention (--max-requests)
  2. Graceful reload (SIGHUP)
  3. Worker timeout handling
  4. TTFT and streaming behavior

r-bit-rry avatar Dec 17 '25 19:12 r-bit-rry