Models with the same inference_pool_gid still create a new InferencePool and spawn N parallel workers

Open dinispeixoto opened this issue 5 months ago • 0 comments

Description

The PR #2040 introduced a great feature allowing users to define custom inference pools per model, instead of sharing a single pool across models with different loads.

However, there's a small bug in mlserver/parallel/registry.py when the environment is not provided or an environment tarball is available:

if not env_tarball:
    return (
        self._pools.setdefault(
            inference_pool_gid,
            InferencePool(self._settings, on_worker_stop=self._on_worker_stop),
        )
        if inference_pool_gid
        else self._default_pool
    )

If inference_pool_gid already exists in self._pools, a new InferencePool instance is still created (and thus spawns N new worker processes) before setdefault checks for an existing key.

From the InferencePool constructor:

def __init__(
    self,
    settings: Settings,
    env: Optional[Environment] = None,
    on_worker_stop: List[InferencePoolHook] = [],
):
    configure_inference_pool(settings)

    ...
    for _ in range(self._settings.parallel_workers): # spawning Python processes
        worker = _spawn_worker(self._settings, self._responses, self._env)
        self._workers[worker.pid] = worker  # type: ignore

This leads to redundant process creation even when the pool already exists.

Steps to reproduce

Create 2 ML models with the same inference_pool_gid, e.g. in model-settings.json

{
    "name": "foo",
    "implementation": "...",
    "parameters": {
        "inference_pool_gid": "bar"
    }
}

Set the parallel_workers in settings.json to 2

{
    "debug": "true",
    "use_structured_logging": "true",
    "parallel_workers": 2,
}

Start MLServer and check the number of worker processes:

ps -ef | grep spawn_main | grep python | wc -l

Expected: 4 processes (2 for the default pool + 2 for the custom bar pool) Observed: 6 processes, as InferencePool is instantiated twice.

This can also be demonstrated in a Python shell:

>>> class Foo:
...     def __init__(self):
...             print("hello")
... 
>>> bar = {}
>>> bar.setdefault("1", Foo())
hello
<__main__.Foo object at 0x104916da0>
>>> bar.setdefault("1", Foo())
hello
<__main__.Foo object at 0x104916da0>
>>>

setdefault() still calls Foo() twice because the argument is evaluated before checking if the key exists.

Impact

Orphan processes are spawned unnecessarily.
These processes are never used for inference but remain alive.
Can lead to high memory usage and degraded performance in production environments with multiple models or high worker counts.

Oct 09 '25 08:10 dinispeixoto