MLServer icon indicating copy to clipboard operation
MLServer copied to clipboard

Problem with performance during serving a multi mlflow models server

Open MarcinSkrobczynski opened this issue 3 years ago • 5 comments
trafficstars

Hi,

I had a problem with the performance when I served 20 mlflow models.

For building a mlflow model I used the example described here: https://mlserver.readthedocs.io/en/latest/examples/mlflow/README.html

I changed only one thing related to the example:

    alpha = float(sys.argv[1]) if len(sys.argv) > 1 else random.random()
    l1_ratio = float(sys.argv[2]) if len(sys.argv) > 2 else random.random()

and this randomization I used to have slightly different models.

For every built model, I added model-settings.json file with a name the same as ID of the mlflow model tree mlruns/0/299341124940457299c7d5624e04aaf1:

mlruns/0/299341124940457299c7d5624e04aaf1
├── artifacts
│   └── model
│       ├── conda.yaml
│       ├── MLmodel
│       ├── model.pkl
│       └── requirements.txt
├── meta.yaml
├── metrics
│   ├── mae
│   ├── r2
│   └── rmse
├── model-settings.json
├── params
│   ├── alpha
│   └── l1_ratio
└── tags
    ├── mlflow.log-model.history
    ├── mlflow.project.backend
    ├── mlflow.project.entryPoint
    ├── mlflow.project.env
    ├── mlflow.source.name
    ├── mlflow.source.type
    └── mlflow.user

and cat mlruns/0/299341124940457299c7d5624e04aaf1/model-settings.json:

{
    "name": "299341124940457299c7d5624e04aaf1",
    "implementation": "mlserver_mlflow.MLflowRuntime",
    "parameters": {
        "uri": "./artifacts/model/"
    }
}

It was done similarly for other 19 models. Then I run mlserver using mlserver start . and I saw 20 logs about successfully loaded models.

Finally, I used the payload (inference_request) from https://mlserver.readthedocs.io/en/latest/examples/mlflow/README.html#send-test-inference-request and run in the loop:

for _ in range(20):
    for model in models:
        endpoint = f"http://localhost:3000/v2/models/{model}/infer"
        inference_request["inputs"][0]["data"][0] += 0.1
        resp = requests.post(endpoint, json=inference_request)
        response = resp.json()
        print(response)

and then my machine stucked/freezed in the way I needed to restart it.

I ran again this setup but for lower number of models (5) and I observed that for every model it spawns some processes and allocate memory for 4 subprocesses. The CPU and memory usage in the first phase is high. Also it is never deallocated and for the quick requests (one after prior) - I would like to run the similar scenario as substeps of my ML processing - it looks like my machines will stuck every time.

Have you ever observed something like that? Have you tried to run about 20 models?

MarcinSkrobczynski avatar Feb 26 '22 17:02 MarcinSkrobczynski

Hi @MarcinSkrobczynski,

As you have spotted, MLServer spawns a number of worker processes (4 by detault) for each model. This number is controlled by the parameter parallel_workers and can be adjusted up or down as you need - for local testing, setting this to 1 might be helpful.

These workers are long-lived and until recently were created on the first inference request, hence the delay in responding to the first request for each model. However, that issue was fixed around two weeks ago - which version of MLServer are you using?

For your machine freezing, that sounds like the operating system swapping out to disk. If you monitor memory consumption and swap usage, do you see high memory pressure and swapping as you increase the number of models?

agrski avatar Feb 27 '22 18:02 agrski

To add on top to what @agrski mentioned, we're currently working on exploring other architectures to allow inference to run in parallel across multiple models ( https://github.com/SeldonIO/MLServer/issues/434). We spawn each worker separately to avoid some of the problems that ML frameworks usually have with multiprocessing, and that leads to an extra overhead per worker (due to loaded libraries, environment, etc).

We're now still on the design phase, but once we have something more solid we'll share it with the community on that other issue.

adriangonz avatar Feb 28 '22 10:02 adriangonz

@agrski I am using a version from pypi:

mlserver==1.0.0
mlserver-mlflow==1.0.0

All is done in a docker container and as I mentioned memory usage is high. It is only one thing I could observe.

MarcinSkrobczynski avatar Feb 28 '22 11:02 MarcinSkrobczynski

Hi @MarcinSkrobczynski,

If you're using a suffficiently recent version of Docker, it supports memory limits and disabling swap space. You can set these to sensible values for the machine you're running on to reduce the risk of the machine as a whole becoming unresponsive; note that there's also a section for CPU limits.

For versions, as the release doesn't seem to include resolved issues, do you know off-hand @adriangonz if that fix made it in? The dates are the same.

agrski avatar Feb 28 '22 11:02 agrski

Hi @agrski,

I know that docker supports it. Thank you the whole explanation.

MarcinSkrobczynski avatar Feb 28 '22 11:02 MarcinSkrobczynski

Hey @MarcinSkrobczynski,

We've recently released MLServer 1.1.0, which includes a number of improvements around parallel inference, particularly around memory usage. It would be great if you could give that a try.

In the meantime, I'll be closing this issue, since it should be solved by the new changes. However, feel free to reopen if you still experience performance issues with 1.1.0.

adriangonz avatar Sep 07 '22 16:09 adriangonz