gunicorn icon indicating copy to clipboard operation
gunicorn copied to clipboard

Torch with Gunicorn + Flask API performance issue on Docker

Open yothinsaengs opened this issue 9 months ago • 3 comments

I use Gunicorn as web server with flask api and I have performance issue compare with using Waitress as web server with flask when I try to calculate matrix multiplication wth numpy there's no huge different in response time between Gunicorn and Waitress

Numpy API

@app.route('/numpy')
def _numpy():
    matrix_a = np.random.rand(640, 640, 3)
    count = 0
    while count < 240:
        matrix_a = (matrix_a**2) % 7
        count += 1
    return jsonify({"message": "Hello, World!"})

But when I calculate the same operation with torch (both enable and disable torch_no_grad)

Torch API

@app.route('/torch')
def _torch():
    matrix_a = torch.rand(640, 640, 3)  # Create a random tensor
    count = 0
    while count < 240:
        matrix_a = (matrix_a ** 2) % 7  # Element-wise squaring and modulo
        count += 1
    return jsonify({"message": "Hello, World!"})

Torch_no_grad API

@app.route('/torch_no_grad')
def _torch_ng():
    with torch.no_grad():
        matrix_a = torch.rand(640, 640, 3)  # Create a random tensor
        count = 0
        while count < 240:
            matrix_a = (matrix_a ** 2) % 7  # Element-wise squaring and modulo
            count += 1
    return jsonify({"message": "Hello, World!"})

there is a huge difference in response time

limits:
  memory: 1g
  cpus: '8.0'

numpy
----------
waitress: Mean=1.1698s, Std=0.0300s
gunicorn: Mean=1.1715s, Std=0.0311s

torch
----------
waitress: Mean=0.9230s, Std=0.1078s
gunicorn: Mean=0.8869s, Std=0.1190s

torch_no_grad
----------
waitress: Mean=0.9172s, Std=0.1058s
gunicorn: Mean=0.8886s, Std=0.1126s

limits:
  memory: 1g
  cpus: '4.0'

numpy
----------
waitress: Mean=1.1876s, Std=0.0407s
gunicorn: Mean=1.1897s, Std=0.0390s

torch
----------
waitress: Mean=0.9502s, Std=0.1281s
gunicorn: Mean=0.9180s, Std=0.1288s

torch_no_grad
----------
waitress: Mean=0.9119s, Std=0.1063s
gunicorn: Mean=0.8678s, Std=0.1105s

limits:
  memory: 1g
  cpus: '2.0'

numpy
----------
waitress: Mean=1.1881s, Std=0.0494s
gunicorn: Mean=1.1835s, Std=0.0424s

torch
----------
waitress: Mean=0.7837s, Std=0.1328s
gunicorn: Mean=1.3097s, Std=0.0544s

torch_no_grad
----------
waitress: Mean=0.7932s, Std=0.0988s
gunicorn: Mean=1.3300s, Std=0.1083s

I evaluate this on machine spec: Macbook Air m2 ram16

this is api that send request to Gunicorn and Waitress

import asyncio
import httpx
import time  
from collections import defaultdict
import numpy as np 
N = 1
url_paths = ["numpy", "torch", "torch_no_grad"]
API_URLS = [
    "http://localhost:8001/",
    "http://localhost:8002/",
]
API_URLS_DICT = {
    "http://localhost:8001/": "waitress",
    "http://localhost:8002/": "gunicorn",
}


async def fetch(client, url):
    start_time = time.perf_counter()  # Start timing
    response = await client.get(url+url_path, timeout=20.0)

    end_time = time.perf_counter()  # End timing

    response_time = end_time - start_time  # Calculate response time
    return {
        "url": url,
        "status": response.status_code,
        "response_time": response_time,
        "data": response.json()
    }


async def main():
    async with httpx.AsyncClient() as client:
        tasks = [fetch(client, url) for url in API_URLS for _ in range(N)]
        results = await asyncio.gather(*tasks)

    return results

if __name__ == "__main__":
    repeat_time = 5
    for url_path in url_paths:
        count = defaultdict(list)
        print(url_path)
        print('----------')
        for _ in range(repeat_time):
            y = asyncio.run(main())
            for x in y:
                count[API_URLS_DICT[x['url']]].append(x['response_time'])

        for k, v in count.items():
            v = np.array(v)
            print(f"{k}: Mean={v.mean():.4f}s, Std={v.std():.4f}s")

        print()

yothinsaengs avatar Feb 18 '25 04:02 yothinsaengs

Thanks for the detailed report.

How did you launch the test targets? Specifically, I am inquiring about the command lines containing the localhost:8001 (resp localhost:8002) listen address. I am assuming you are testing against Gunicorn 23.0 on Python 3.11, correct?

pajod avatar Feb 19 '25 15:02 pajod

Thanks for the detailed report.

How did you launch the test targets? Specifically, I am inquiring about the command lines containing the localhost:8001 (resp localhost:8002) listen address. I am assuming you are testing against Gunicorn 23.0 on Python 3.11, correct?

python version is 3.10, here is Dockerfile

# Use official Python image
FROM python:3.10

# Set the working directory
WORKDIR /app

# Copy the application files
COPY app.py requirements.txt ./

# Install dependencies
RUN pip install -r requirements.txt
# Install curl for health check
RUN apt-get update && apt-get install -y curl  

# Expose port 8002
EXPOSE 8002

# Run the app with Gunicorn (use default worker count)
CMD ["gunicorn", "-b", "0.0.0.0:8002", "app:app"]

note: there is no different with or without health check in performance

yothinsaengs avatar Feb 20 '25 02:02 yothinsaengs

did you try the thread worker?

benoitc avatar Feb 21 '25 16:02 benoitc