pytest-xdist icon indicating copy to clipboard operation
pytest-xdist copied to clipboard

Issue with pytest-xdist Handling Out of Memory Errors(IndexError)

Open loveleenamar9 opened this issue 1 year ago • 1 comments

Hi, I am currently utilizing pytest-xdist to execute a test suite that includes subgraph tests. Sporadically, I encounter an IndexError when attempting to load a large model, which results in the process being terminated due to an Out of Memory (OOM) issue. While pytest-xdist gracefully handles other crashes, it appears to struggle with those caused by OOM errors. The worker crash is expected but the crashed worker is not getting replaced properly in this case leading to IndexError.

Below is an example of the error log:

2024-10-27T21:28:18Z  tensorflow	[gw13] [ 70%] FAILED layerwise/Mistral7b/test_model_layers_0.py::test_model_layers_0 
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	replacing crashed worker gw13
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> def worker_internal_error(
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         self, node: WorkerController, formatted_error: str
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     ) -> None:
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         """
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         pytest_internalerror() was called on the worker.
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         pytest_internalerror() arguments are an excinfo and an excrepr, which can't
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         be serialized, so we go with a poor man's solution of raising an exception
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         here ourselves using the formatted message.
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         """
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         self._active_nodes.remove(node)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         try:
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> >           assert False, formatted_error
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E           AssertionError: Traceback (most recent call last):
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 271, in wrap_session
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 session.exitstatus = doit(config, session) or 0
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 325, in _main
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 config.hook.pytest_runtestloop(session=session)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513, in __call__
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120, in _hookexec
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 182, in _multicall
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 return outcome.get_result()
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_result.py", line 100, in get_result
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 raise exc.with_traceback(exc.__traceback__)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103, in _multicall
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 res = hook_impl.function(*args)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/root/.local/lib/python3.10/site-packages/xdist/remote.py", line 174, in pytest_runtestloop
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 self.run_one_test()
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/root/.local/lib/python3.10/site-packages/xdist/remote.py", line 185, in run_one_test
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 item = items[self.item_index]
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E             IndexError: list index out of range
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E           assert False
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> 
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> /root/.local/lib/python3.10/site-packages/xdist/dsession.py:232: AssertionError
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> Traceback (most recent call last):
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/_pytest/main.py", line 273, in wrap_session
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/_pytest/main.py", line 327, in _main
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513, in __call__
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120, in _hookexec
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 139, in _multicall
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     raise exception.with_traceback(exception.__traceback__)
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 122, in _multicall
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     teardown.throw(exception)  # type: ignore[union-attr]
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/_pytest/logging.py", line 796, in pytest_runtestloop
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103, in _multicall
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     res = hook_impl.function(*args)
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/xdist/dsession.py", line 138, in pytest_runtestloop
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     self.loop_once()
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/xdist/dsession.py", line 152, in loop_once
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     raise RuntimeError("Unexpectedly no active workers available")
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> RuntimeError: Unexpectedly no active workers available

The issue can be reproduced by creating a dummy test that allocates a large amount of memory:

PYTHON

def test_oom():
    large_memory_allocation = []
    for _ in range(175):
        large_memory_allocation.append([0] * (1024**3 // 4))

I suspect that the synchronization between the worker and the master process is not occurring correctly, leading to incomplete communication.

Note: This issue is observed only with a large test suite.

Could you please provide support on what's causing this IndexError and how to resolve this, so that pytest-xdist can handle OOM errors gracefully?

Thanks! Loveleen.

loveleenamar9 avatar Nov 18 '24 12:11 loveleenamar9

this looks indeed like a missed case in worker restart

its possibly related to oom preventing messages due to the hard kill

most normal worker restarts get some kind of message

RonnyPfannschmidt avatar Nov 18 '24 14:11 RonnyPfannschmidt