torch-mlir
torch-mlir copied to clipboard
running tests on large number of cpu/cores causes resource exhaustion
When running on systems with 64+ cores you can run into issues with tests trying to spawn cpu_count * 1.1 threads.
https://github.com/llvm/torch-mlir/blob/f245613b71b82eb2ad7ead22ef3499ebcd925a92/python/torch_mlir_e2e_test/torchscript/framework.py#L334
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir_e2e_test/torchscript/framework.py", line 374, in worker
compile_and_run_test(tests_dict[test_name], config))
File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir_e2e_test/torchscript/framework.py", line 301, in compile_and_run_test
trace = config.run(compiled, golden_trace)
File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir_e2e_test/torchscript/configs/eager_mode.py", line 59, in run
outps = attr(*inps)
File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir_e2e_test/test_suite/rng.py", line 222, in forward
torch.flatten(torch.std(b)),
File "/home/anush/github/torch-mlir/mlir_venv/lib/python3.10/site-packages/torch/_tensor.py", line 1265, in __torch_function__
ret = func(*args, **kwargs)
File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/eager_mode/torch_mlir_tensor.py", line 160, in __torch_dispatch__
op_mlir_backend_callable = backend.compile(eager_module)
File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir_e2e_test/eager_backends/refbackend.py", line 68, in compile
run_pipeline_with_repro_report(
File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/compiler_utils.py", line 47, in run_pipeline_with_repro_report
pm.run(module)
KeyboardInterrupt
Process ForkProcess-7:
Traceback (most recent call last):
File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/compiler_utils.py", line 47, in run_pipeline_with_repro_report
pm.run(module)
RuntimeError: Failure while executing pass pipeline.
During handling of the above exception, another exception occurred:
We should cap it to max of 16. Will send a PR.
you can workaround this by setting ulimit -n unlimited but still we probably want to cap the max number of parallel threads
I run it with 100+ cores regularly. What issue are you seeing? / Can you dig more into what is actually failing on your system?