BentoML icon indicating copy to clipboard operation
BentoML copied to clipboard

bug: Uneven request distribution across workers in BentoML service

Open chlee1016 opened this issue 5 months ago • 10 comments

Describe the bug

While running the BentoML service with 4 workers (each with 1 thread), it appears that the incoming HTTP requests are not evenly balanced across the worker processes. I'd like to know if there's any configuration I'm missing.

2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:43376 (scheme=http,method=POST,path=/classify,type=application/json,length=81) (status=200,type=application/json,length=10) 201.820ms (trace=2cf1120afa437d5ebe8f9792eb3519b0,span=2507eaf9014aba6e,sampled=0,service.name=Predictor)
2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:43608 (scheme=http,method=POST,path=/classify,type=application/json,length=81) (status=200,type=application/json,length=10) 202.028ms (trace=5565f4c5a624f3cda4c4835b2853f727,span=0e51701b5e614c96,sampled=0,service.name=Predictor)
2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:43816 (scheme=http,method=POST,path=/classify,type=application/json,length=81) (status=200,type=application/json,length=10) 201.912ms (trace=d475a2d347df94ca4edd3da858d16905,span=7b44259dafc4f72b,sampled=0,service.name=Predictor)
2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44090 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/json,length=10) 201.699ms (trace=a539c617542d29cb80aa70dc8b2ee42d,span=3d80958ead45d70c,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44270 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.748ms (trace=fdb5dc2c2773f0d87a2ff20acdd45eb3,span=d83e432d948c33dc,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44538 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.562ms (trace=dcdaccdab55fec02b2aff73d8227d733,span=9f3ab4fe948da5df,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44716 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/json,length=10) 201.492ms (trace=84fc319d34bfa9337a2eb08314015323,span=d0b1e9ead2bb78eb,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44962 (scheme=http,method=POST,path=/classify,type=application/json,length=78) (status=200,type=application/json,length=10) 201.729ms (trace=ae1647b92484b84a80ba7cee49d9667c,span=537587c62b086b50,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:45190 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.974ms (trace=ac90d9510dc60c5162309c900911d846,span=a1842bae213272ce,sampled=0,service.name=Predictor)
2025-08-06T06:00:55+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:45382 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/
...

Here is the simple statistics.

Image

Full logs are attached here. bentoml_test_server.log

Expected behavior

All the requests should evenly distributed among the workers.

To reproduce

  1. Prepare simple BentoML class.
# service3.py
import bentoml
import logging
import time

bentoml_logger = logging.getLogger("bentoml")


@bentoml.service(workers=4, threads=1)
class Predictor:
    def __init__(self):
        pass

    @bentoml.api
    def classify(self, input_ids: list[list[int]]) -> list[float]:
        """
        input_ids example:
        [[82, 13, 59, 45, 97, 36, 74, 6, 91, 12, 33, 19, 77, 68, 40, 50]]
        """
        time.sleep(0.2)

        return [0.1, 0.2]
  1. Run the BentoML service
$ bentoml serve service3:Predictor
  1. Check all the workers are running through htop. Image

  2. Prepare client code to generate HTTP requests.

import numpy as np
import requests
import time

def classify_input_ids():
    input_ids = np.random.randint(0, 100, (1, 16)).tolist()
    response = requests.post(
        "http://bentoml-test-server:3000/classify",
        json={"input_ids": input_ids},
        headers={
            "accept": "text/plain",
            "Content-Type": "application/json",
            "Connection": "close"
        }
    )
    print("Status Code:", response.status_code)
    print("Response:", response.text)

def run_for_duration(seconds: int):
    end_time = time.time() + seconds
    count = 0
    while time.time() < end_time:
        classify_input_ids()
        count += 1
    print(f"Sent {count} requests in total.")

if __name__ == "__main__":
    duration = int(input("Enter the duration to send requests (in seconds): "))
    run_for_duration(duration)
  1. Run the client code.
$ python3 bento_request_en.py 
Enter the duration to send requests (in seconds): 180
...
  1. Check the logs of BentoML service.
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44538 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.562ms (trace=dcdaccdab55fec02b2aff73d8227d733,span=9f3ab4fe948da5df,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44716 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/json,length=10) 201.492ms 

Environment

bentoml: 1.4.19

chlee1016 avatar Aug 06 '25 06:08 chlee1016

We do not have control over which worker the data will go through: https://circus.readthedocs.io/en/latest/tutorial/step-by-step/#installation:~:text=Note%20The%20load%20balancing%20is%20operated%20by%20the%20operating%20system%20so%20you%E2%80%99re%20getting%20the%20same%20speed%20as%20any%20other%20pre%2Dfork%20web%20server%20like%20Apache%20or%20NGinx.%20Circus%20does%20not%20interfer%20with%20the%20data%20that%20goes%20through.

How does it matter in your case?

frostming avatar Aug 06 '25 07:08 frostming

‌‌‌@chlee1016 May I ask what your operating system is? If it's Linux, what is the kernel version?

bojiang avatar Aug 06 '25 08:08 bojiang

@bojiang Sure, here are the additional information. OS: Debian GNU/Linux 12 Kernel version: 5.10.109-1.20220408.el7.x86_64

chlee1016 avatar Aug 06 '25 09:08 chlee1016

@chlee1016 Thanks. Another thing is, currently the requests are sent one by one. Would you like to test it again with some real concurrent requests?

bojiang avatar Aug 06 '25 09:08 bojiang

We are still following up @chlee1016

bojiang avatar Aug 08 '25 03:08 bojiang

@bojiang , I have conducted the test about concurrent situation, but got similar result.

  1. Prepare 4 pods running on different nodes.
  2. Run python3 bento_request_en.py in each pods.
  3. Check the result.

Similar to previous test, it seems that almost all of the requests are passed to #3 worker. The log file is attached here.

bentoml_test_concurrent_requests.log

chlee1016 avatar Aug 08 '25 03:08 chlee1016

2. Run python3 bento_request_en.py in each pods.

The balancing result is not correct since you are sending requests in sequence

frostming avatar Aug 08 '25 03:08 frostming

@frostming

The balancing result is not correct since you are sending requests in sequence

I ran the experiment because I expected the requests to be distributed across multiple workers, even though it is not controlled by application. Do you mean that I should make the service API asynchronous and have the client send requests asynchronously as well?

chlee1016 avatar Aug 12 '25 09:08 chlee1016

Do you mean that I should make the service API asynchronous and have the client send requests asynchronously as well?

Just make the client send requests concurrently

frostming avatar Aug 12 '25 09:08 frostming

@frostming

Click to expand (client code)
import numpy as np
import requests
import time
import threading

URL = "http://localhost:8080/classify"

def classify_input_ids():
    input_ids = np.random.randint(0, 100, (1, 16)).tolist()
    resp = requests.post(
        URL,
        json={"input_ids": input_ids},
        headers={
            "accept": "text/plain",
            "Content-Type": "application/json",
            "Connection": "close",
        },
        timeout=10,
    )
    return resp.status_code, resp.text

def worker_thread(name: str, end_time: float, counter: list, lock: threading.Lock):
    local_count = 0
    while time.time() < end_time:
        try:
            status, _ = classify_input_ids()
            local_count += 1
        except Exception:
            pass
    with lock:
        counter[0] += local_count

def run_concurrent(duration: int, num_workers: int):
    end_time = time.time() + duration
    counter = [0]
    lock = threading.Lock()
    threads = []

    for i in range(num_workers):
        t = threading.Thread(target=worker_thread, args=(f"worker-{i+1}", end_time, counter, lock), daemon=True)
        threads.append(t)
        t.start()

    for t in threads:
        t.join()

    print(f"Total {counter[0]} requests sent. (Workers: {num_workers}, Duration: {duration}s)")

if __name__ == "__main__":
    try:
        duration = int(input("Enter duration in seconds: "))
        workers = int(input("Enter number of workers: "))
    except ValueError:
        print("Please enter an integer.")
        raise

    run_concurrent(duration, workers)
Image

Results (10 secs, 4 client threads) Worker 1: 8 requests Worker 2: 52 requests Worker 3: 2 requests Worker 4: 171 requests

Summary: The result is same with previous test.

While testing, I found that there was a similar discussion in uvicorn.

From what I understand, in BentoML v1.4, the workers accept connections from a shared listening socket, and the operating system is responsible for load balancing incoming connection requests.

Q. My goal is to serve on Kubernetes by running multiple workers within a single container using BentoML. Would this approach be considered an anti-pattern?

I’d also like to know if there are any alternative solutions to address this. 🙏

chlee1016 avatar Aug 21 '25 02:08 chlee1016