BentoML bug: Batching async run issue

Describe the bug

i am using a custom runner model

i am testing batching by enabling batching on the runner and testing it using bentoml server --production flag i have configured a timeout of 60 seconds and batching configuration of 100ms timeout and 1000 as max batch size

when every i perform requests with batching enabled , i am getting the below error

2022-11-08T13:26:15+0000 [INFO] [api_server:1] 127.0.0.1:35046 (scheme=http,method=POST,path=/v1/get_intents,type=application/json,length=91) (status=200,type=application/json,length=20) 5802.654ms (trace=456b2c38d391f41db16db8d862fcc898,span=ed66b68b017a42fe,sampled=0)
Traceback (most recent call last):
  File "/workspace/personality_framework/personality_service/bento_service.py", line 238, in get_intent
    result=await runner1.is_positive.async_run([{"sentence":query}])
  File "/tmp/e2/lib/python3.8/site-packages/bentoml/_internal/runner/runner.py", line 53, in async_run
    return await self.runner._runner_handle.async_run_method(  # type: ignore
  File "/tmp/e2/lib/python3.8/site-packages/bentoml/_internal/runner/runner_handle/remote.py", line 207, in async_run_method
    raise ServiceUnavailable(body.decode()) from None
bentoml.exceptions.ServiceUnavailable: Service Busy

however without async the same code works properly without any issues

Below is the test file

import os
import json
import bentoml

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from bentoml.io import Text, JSON
model = KNeighborsClassifier()
iris = load_iris()
X = iris.data[:, :4]
Y = iris.target
model.fit(X, Y)


class QueryAnalysisRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("nvidia.com/gpu","cpu")
    SUPPORTS_CPU_MULTI_THREADING = False

    def __init__(self, context):


        self.model = model


    @bentoml.Runnable.method(batchable=True)
    def is_positive(self, input_text):
        print("batch size is ",len(input_text))
        scores = self.model.predict(input_text)
        return scores

ctx = {}
ctx["system_properties"] = {}
ctx["system_properties"]["model_dir"] = "model/bentoml/"


runner1 = bentoml.Runner(
    QueryAnalysisRunnable,
    name="testhandler",
    runnable_init_params={
         "context": ctx,
    },
    max_batch_size=1000,
    max_latency_ms=250,
)


svc = bentoml.Service("test_service", runners=[runner1])

@svc.api(input=JSON(),
         output=JSON(),
         route="v1/test"
         )
def test(input: JSON) -> JSON:
    output_json = {}
    data = input
    try:
        #data = request.get_json()
        print("\nInput JSON: {}".format(data))
        if all(key in data for key in ['query', 'timezone', 'skill_id']):
            query = data['query']
            timezone = data['timezone']
            skill_id = data['skill_id']

            a=[[0.1,2,3,4]]
            result= runner1.is_positive.run(a)
            #print(result)

            #output_json={}
            output_json["result"]=result
            output_json['status'] = 'success'
        else:
            output_json['status'] = 'failure'
    except Exception as e:
        import traceback
        traceback.print_exc()
        output_json['status'] = 'failure'
        Log.e(TAG, "An exception occurred: {}".format(e))

    return output_json



@svc.api(input=JSON(),
         output=JSON(),
         route="v1/test1"
         )
async def test1(input: JSON) -> JSON:
    output_json = {}
    data = input
    try:
        #data = request.get_json()
        print("\nInput JSON: {}".format(data))
        if all(key in data for key in ['query', 'timezone', 'skill_id']):
            query = data['query']
            timezone = data['timezone']
            skill_id = data['skill_id']

            a=[[0.1,2,3,4]]
            result= await runner1.is_positive.async_run(a)
            #print(result)

            #output_json={}
            output_json["result"]=result
            output_json['status'] = 'success'
        else:
            output_json['status'] = 'failure'
    except Exception as e:
        import traceback
        traceback.print_exc()
        output_json['status'] = 'failure'
        Log.e(TAG, "An exception occurred: {}".format(e))

    return output_json

To reproduce

to reproduce the issues , save the above code in test.py file pip install bentoml scikit-learn==1.0.2 numpy scipy

run the code bentoml --production test.py --api-workers=1

run the below test code

synchronous mode url="http://0.0.0.0:3000/v1/test" for x in {1..500}; do curl --request POST $url --header 'Content-Type: application/json' --data-raw '{"query":"are you a real robot","timezone":"Asia/Kolkata", "lang_code":"en","skill_id":"0"}' & done
with async mode using async_run method to call the runner url="http://0.0.0.0:3000/v1/test1" for x in {1..500}; do curl --request POST $url --header 'Content-Type: application/json' --data-raw '{"query":"are you a real robot","timezone":"Asia/Kolkata", "lang_code":"en","skill_id":"0"}' & done

The async run method causes lot of errors with above error

Expected behavior

with async mode should be able to handle all the requests successfully and support higher throughput compared to non synchronous mode

Traceback (most recent call last):
  File "/workspace/personality_framework/personality_service/test.py", line 108, in test1
    result= await runner1.is_positive.async_run(a)
  File "/workspace/personality_framework/personality_service/r2/lib/python3.8/site-packages/bentoml/_internal/runner/runner.py", line 53, in async_run
    return await self.runner._runner_handle.async_run_method(  # type: ignore
  File "/workspace/personality_framework/personality_service/r2/lib/python3.8/site-packages/bentoml/_internal/runner/runner_handle/remote.py", line 207, in async_run_method
    raise ServiceUnavailable(body.decode()) from None
bentoml.exceptions.ServiceUnavailable: Service Busy

without async run the scripts executes successfully with no error

Environment

configuration.yml

runners:
  timeout: 60
  logging:
      access:
        enabled: false
        request_content_length: true
        request_content_type: true
        response_content_length: true
        response_content_type: true
  metrics:
      enabled: false
      namespace: bentoml_runner
  batching:
    enabled: true
    max_batch_size: 100
    max_latency_ms: 1000
api_server:
  timeout: 60

Nov 08 '22 13:11 pi2cto

same issue here, any solution?

Nov 22 '22 11:11 DeepDarkOdyssey

Same issue, it was quite smooth for me when I was using 0.13-lts but started having this issue when 1.0.

Nov 29 '22 20:11 zhangyilun

I got the same issue. Any advice?

Jan 04 '23 06:01 hoangphucITJP

Facing same issue, is there way to fix it?

May 23 '23 06:05 przdev

@przdev I tried the steps in the original post and couldn't manage to reproduce it. Since it is several months old, can you provide the steps to reproduce it?

May 23 '23 07:05 frostming

I am having the same issue, has anyone found a solution?

Oct 30 '23 10:10 KaluginD

This looks like the max_latency_ms is just set too low, have you tried increasing that value?

Oct 30 '23 14:10 sauyon

BentoML BentoML copied to clipboard

bug: Batching async run issue

Describe the bug

To reproduce

Expected behavior

Environment

BentoML
BentoML copied to clipboard