BentoML icon indicating copy to clipboard operation
BentoML copied to clipboard

--production flag slows down my requests

Open jerpint opened this issue 2 years ago • 21 comments

Describe the bug

When I add the --production flag to the bentoml serve command, model serving becomes extremely slow compared to without the flag. The --production flag seems to make many predictions in parallel and thus slow down processing. The exact same data (and outputs) are obtained in both cases.

To Reproduce

Expected behavior I would expect the time to process an input to be identical (or faster) with --production Screenshots/Logs

Serve the model with: bentoml serve my_model:latest Ping server: curl -X POST -H "content-type: application/json" --data "path_to_file" http://127.0.0.1:5000/predict Output: image

Serve the model with: bentoml serve my_model:latest --production Ping server: curl -X POST -H "content-type: application/json" --data "path_to_file" http://127.0.0.1:5000/predict Output: image

Environment:

  • OS: Ubuntu 18.05
  • Python Version [Python 3.9.7]
  • BentoML Version [bentoml @ git+https://github.com/bentoml/BentoML.git@3b4bc6a7e061285908a8c4b248c57ad52224c2d2]

jerpint avatar Feb 13 '22 15:02 jerpint

Hi, jerpint. Thanks for reporting this. Would you mind sharing the bento service definition snippet? For example:

# bento.py
import bentoml
import bentoml.sklearn
import numpy as np

from bentoml.io import NumpyNdarray

# Load the runner for the latest ScikitLearn model we just saved
iris_clf_runner = bentoml.sklearn.load_runner("iris_clf:latest")

# Create the iris_classifier service with the ScikitLearn runner
# Multiple runners may be specified if needed in the runners array
# When packaged as a bento, the runners here will included
svc = bentoml.Service("iris_classifier", runners=[iris_clf_runner])

# Create API function with pre- and post- processing logic with your new "svc" annotation
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def predict(input_ndarray: np.ndarray) -> np.ndarray:
    # Define pre-processing logic
    result = iris_clf_runner.run(input_ndarray)
    # Define post-processing logic
    return result

You can hide sensitive information as you want.

bojiang avatar Feb 13 '22 17:02 bojiang

It looks something like this (obfuscated a little):

import bentoml
import torch
import bentoml.sklearn
import numpy as np
import json
from yaml import load
import yaml

from bentoml.io import Text, JSON
from my_project import make_predict_dataloader

runner = bentoml.pytorch.load_runner("my_model:latest")

svc = bentoml.Service("my_app", runners=[runner])

CONFIG_FILEPATH = "config.yaml"
with open(CONFIG_FILEPATH, "r") as stream:
    hyper_params = load(stream, Loader=yaml.FullLoader)


@svc.api(input=Text(), output=JSON())
def predict(filename: str) -> dict:
    # Generate the dataloader with the soundfile
    predict_dataloader = make_predict_dataloader(filename, hyper_params)

    # Iterate through each batch and gather results
    all_probs = []
    for data in predict_dataloader:
        pred = runner.run_batch(data)
        probs = torch.nn.functional.softmax(pred, dim=1)
        all_probs.extend(probs.cpu().numpy())

    # Collect and post-process results
    all_probs = np.array(all_probs)
    avg_prob = np.sum(all_probs, 0) / len(all_probs)
    result = {
        "prob1": round((avg_prob[1] * 100), 2),
        "prob2": round((avg_prob[0] * 100), 2),
    }

    return json.dumps(result)

jerpint avatar Feb 13 '22 17:02 jerpint

short of finding a solution, is there a way to run the container without the --production flag? It seems to be the default behaviour when running docker run MY_IMAGE and I observe the same slow behaviour and logs

jerpint avatar Feb 14 '22 05:02 jerpint

so this make_predict_dataloader function return a list of data? have you try to just pass the dataloader in and let run_batch handles inference that way?

aarnphm avatar Feb 14 '22 13:02 aarnphm

make_predict_dataloader returns predict_dataloader, which is a typical pytorch torch.utils.data.dataloader.DataLoader class.

The idea is that in my case, a "single prediction" actually needs to iterate over many samples, hence why I create a dataset+dataloader for my sample. Think of 1 prediction being e.g. a batched prediction on a collection of images - I iterate through each, and collect the result for each batch (and so for each image), the final result being an aggregation of all predictions

If I pass directly the dataloader to run_batch, it fails, as the model doesn't expect to get a dataloader as an input.

jerpint avatar Feb 14 '22 18:02 jerpint

would it be preferable to wrap everything in a dataset instead (without a batch size) and pass that directly to run_batch?

jerpint avatar Feb 14 '22 18:02 jerpint

short of finding a solution, is there a way to run the container without the --production flag? It seems to be the default behaviour when running docker run MY_IMAGE and I observe the same slow behaviour and logs

It is possible to start the container with your own command, including running bentoml serve without the flag: docker run -p 5000:5000 YOUR_IMAGE_TAG bentoml serve $BENTO_PATH. Although this is not recommended for production usage.

The reason for it being slower under --production is most likely due to lots of round trips between the API server process and runner process. In this case, I'd recommend to rewrite your code in either one of the following styles, depending on how large is the data returned from data loader:

  1. Call run_batch once with a batch of input data, as @aarnphm suggested, if the returned data is relatively small and your model can run inference on the entire data batch at once:
@svc.api(input=Text(), output=JSON())
def predict(filename: str) -> dict:
    ...  # pre processing
    preds = runner.run_batch([data for data in predict_dataloader])
    ...  # post processing
  1. use many async run for each data point, this can take advantage of multiple Runner processes, and the actually batch size feed into the runner#run_batch will be determined in runtime. (aka it can handle very large batch returned from predict_dataloader, as long as the latency allows)
@svc.api(input=Text(), output=JSON())
async def predict(filename: str) -> dict:
    ...  # pre processing
    preds = await asyncio.gather(
        *(runner.async_run(data) for data in predict_dataloader)
    )
    ...  # post processing

Hope that helps. We are working on related docs to help users understand how Runners works, why it helps make your model serving workload more efficient and how to avoid some common pitfalls.

parano avatar Feb 14 '22 19:02 parano

thanks, i've tried implementing 1 using

def predict(filename: str) -> dict:
    ...  # pre processing
    preds = runner.run_batch(torch.cat([data for data in predict_dataloader]))

this works, and i obtain the same output results. However, i still notice a significant speed decrease when passing the --production flag instead of --reload flag. However, this time, the observed logs are the same

jerpint avatar Feb 14 '22 19:02 jerpint

hi @jerpint may I ask how did you measure the speed decrease? Is it the first request has significant longer wait? Were you able to measure an average over a period of time? that would be very helpful for us to understand the issue here

this works, and i obtain the same output results. However, i still notice a significant speed decrease when passing the --production flag instead of --reload flag. However, this time, the observed logs are the same

Note that the observed logs are not guaranteed to be the same when --production is applied. When --production is applied, the runner will be scheduled in its own process and adaptive batching will be enabled.

parano avatar Feb 14 '22 19:02 parano

@parano nothing more than just sending a curl to the server, and seeing how long between hitting enter and getting the response. In both cases, they are local (on localhost), the code is the exact same and all that changes is the flag at the end of the command. In one case, using --reload, the response is almost instant. In the other, using --production, the response is very slow (on the order of seconds and very noticeable). Other than changing the flag, everything else is exactly the same (same data, same response, same results, same outputs). The command is:

bentoml serve ./bento.py:svc --production vs bentoml serve ./bento.py:svc --reload

jerpint avatar Feb 14 '22 20:02 jerpint

Hi guys! I experienced the same behavior as @jerpint . My tests were made with https://k6.io/ tool.
When running the API call burst, without --production flag, with duration of 10sec and one vu (virtual user), I get an average of 147 req/sec. Running the same test with --production flag, also for 10sec, I got an average of 70 req/sec.

And now running the same two tests, but now adding a --vus 100 flag (100 virtual users) to k6, I got 261 req/sec and 79 req/sec respectively.

My model is a pickable_model wrapping a lightfm model. I can help with more tests if you need.

almirb avatar Feb 18 '22 17:02 almirb

Can you guys try with the latest changes on the main branch?

aarnphm avatar Feb 18 '22 19:02 aarnphm

just did a fresh install with the latest main branch, still observing the same behaviour

jerpint avatar Feb 18 '22 21:02 jerpint

Hi, people! Experiencing similar behavior here.

  • OnnxRuntime using GPU on a resnet18-like model
  • Using 64 concurrent requests

Running without --production gives around 200/sec. Running with --production gives around 149.sec, Running the same code without BentoML in a single thread using batches of 32 gives around 350/sec.

Observations:

  • Non-production uses a single process
  • production uses multi processing. I tried different setting from the default to specifying --api-workers of 0, 1, 2 up to 8. without improvement
  • non-production CPU load ~70% on all cores, GPU load ~50%
  • production CPU load ~70% on 1 core and ~40% on other cores, GPU load ~30%
  • The GPU load in production is around 30%
  • I tried different settings for max_batch_size in the to_runner with little improvement
  • Increasing the number of concurrent requests to 128, 256 or 1024 does not improve throughput rate

Theory:

  • Somehow the multi-processing approach used in production makes the total throughput lower?
  • I also used an older version of BentoML before (0.12.x) a while ago and I remember that in this version I was able to get the GPU load close to 100% after some tuning to the number of batch workers. Because there is some pre- and post- processing I want to use the multi-process approach of BentoML. Ideally the pre-/post-processing is done parallel on the CPU while the model invocations can be done as often as needed to keep the GPU close to 100% load. Currently neither CPU nor GPU is at 100% even if I increase the conccurrent requests to 128, 256 or even 1024.

I will keep searching and digging. But also following this issue.

udevnl avatar Aug 01 '22 19:08 udevnl

Thanks for the detailed report @udevnl!

Just wanted to pop in here and let you all know that this is now being worked on, and performance is going to be much more of a priority moving forward.

We've managed to reproduce, and think we've found a major cause of this slowdown, and are working on fixing it right now.

sauyon avatar Aug 04 '22 00:08 sauyon

We've just merged #2267, which should speed things up quite a bit, especially for models with larger input. While I think there's still a little bit of room to improve everywhere in terms of speed, at the very least production throughput should be faster now than development.

Would you be able to try out the changes on the main branch? Or alternatively I'd be happy to ping you when we release this fix.

sauyon avatar Aug 05 '22 04:08 sauyon

Thank you for looking into this so quickly! Just did some benchmarks - I installed the merge using pip install git+https://github.com/bentoml/BentoML.git and double checked the updated source in site-packages. Given the same experimental setup:

  • non-production throughput ~210/sec
  • production throughput ~210/sec

It looks like the throughput is now similar in production, nice!

I did some additional tests which gave me some more observations:

  1. If I skip the call to the model inference runner, the throughput is ~540/sec (average image size 120kb)
  2. If I skip the call to the model inference runner, the throughput is ~740/sec (average image size 3.6kb)
  3. It seems the logging to the console also slows things down on my machine. If I redirect output to /dev/null this increases the above from ~740/sec to ~870/sec.

So this gives an idea of the latency caused by everything except the model runner.

When trying to push the maximum throughput even higher I found the following:

  1. The highest I got is ~270/sec on production using reduced image size of 3.6kb using default settings
  2. The GPU load gets close to 100% in non-production and to around 40% in production
  3. The CPU load stays below 40% on all cores in all cases (production and non-production)

Overall conclusions:

  1. The fix helped solve the slower production performance!
  2. I am still going to look for how to increase the performance even more since the GPU load still stays ~40% and the CPU load is also low. I think there is some headroom somewhere in the form of latency. Because I am only happy when either CPU or GPU get maxed out ;-) If I find something I will post it.

Thank you for the fix

Some performance graphs...

Best run non-production: image

Best run production: image

udevnl avatar Aug 05 '22 09:08 udevnl

Thank you so much for the detailed profiling @udevnl. We are actively working to address this problem. Would still love to hear more about the experiments 🎉

aarnphm avatar Aug 05 '22 12:08 aarnphm

@udevnl Thanks for your help with benchmarking BentoML performance. If neither CPU nor GPU is saturated, the bottleneck could be in the API or Runner code itself. Here are common pitfalls to watch for.

  • Not using async API function definition. This could cause the API function to become the bottleneck.
  • Having CPU intensive compute in the runner. Since the runner is a singleton, having CPU intensive runner would cause the runner to saturate a single core of CPU while leave other cores relatively under utilized. CPU intensive preprocessing logic should be done in the API Servers to leverage multi-core CPU as much as possible.

If possible, we can get on a call to investigate more together. You can find us in on Slack. If you could share a reproducer, that'd help, too.

ssheng avatar Aug 09 '22 22:08 ssheng

I'm also affected, but the development version has not fixed the issues. I'm measuring time spent on the computation in custom runners (directly in the runners) and the actual calls of the runners in the API function. In the dev setting serve --reload are the measurements the same: the overhead between API and runners is negligible. In the serve --production setting there is a 10 ms hit between time measured in runners and in API function. I'm sending a ~600x1000 px image to the runners.

smidm avatar Aug 23 '22 19:08 smidm

@udevnl Thanks for your help with benchmarking BentoML performance. If neither CPU nor GPU is saturated, the bottleneck could be in the API or Runner code itself. Here are common pitfalls to watch for.

  • Not using async API function definition. This could cause the API function to become the bottleneck.
  • Having CPU intensive compute in the runner. Since the runner is a singleton, having CPU intensive runner would cause the runner to saturate a single core of CPU while leave other cores relatively under utilized. CPU intensive preprocessing logic should be done in the API Servers to leverage multi-core CPU as much as possible.

If possible, we can get on a call to investigate more together. You can find us in on Slack. If you could share a reproducer, that'd help, too.

Hey! I did not get the second point about "Having CPU intensive compute in the runner". The runner will execute the inference if I got it correctly; that is CPU bound. So, a CPU bound operation should be definitely performed on a runner, if I got it right... 🤔

francobocciDH avatar Sep 19 '22 13:09 francobocciDH

@francobocciDH I read this an got confused too. Shouldn't it be in the runners?

benjamintanweihao avatar Jan 14 '23 08:01 benjamintanweihao

Sorry about the super slow response, we haven't kept on top of our issues very well. I think that was referencing the case where the inference was on GPU.

As this issue is now quite old, and I believe we've mostly solved the production interface issues, I'm going to close it now. If you're running into this, please feel free to open a new issue!

sauyon avatar Aug 30 '23 01:08 sauyon