torch-mlir icon indicating copy to clipboard operation
torch-mlir copied to clipboard

running tests on large number of cpu/cores causes resource exhaustion

Open powderluv opened this issue 3 years ago • 2 comments

When running on systems with 64+ cores you can run into issues with tests trying to spawn cpu_count * 1.1 threads.

https://github.com/llvm/torch-mlir/blob/f245613b71b82eb2ad7ead22ef3499ebcd925a92/python/torch_mlir_e2e_test/torchscript/framework.py#L334


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir_e2e_test/torchscript/framework.py", line 374, in worker
    compile_and_run_test(tests_dict[test_name], config))
  File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir_e2e_test/torchscript/framework.py", line 301, in compile_and_run_test
    trace = config.run(compiled, golden_trace)
  File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir_e2e_test/torchscript/configs/eager_mode.py", line 59, in run
    outps = attr(*inps)
  File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir_e2e_test/test_suite/rng.py", line 222, in forward
    torch.flatten(torch.std(b)),
  File "/home/anush/github/torch-mlir/mlir_venv/lib/python3.10/site-packages/torch/_tensor.py", line 1265, in __torch_function__
    ret = func(*args, **kwargs)
  File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/eager_mode/torch_mlir_tensor.py", line 160, in __torch_dispatch__
    op_mlir_backend_callable = backend.compile(eager_module)
  File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir_e2e_test/eager_backends/refbackend.py", line 68, in compile
    run_pipeline_with_repro_report(
  File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/compiler_utils.py", line 47, in run_pipeline_with_repro_report
    pm.run(module)
KeyboardInterrupt
Process ForkProcess-7:
Traceback (most recent call last):
  File "/home/anush/github/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/compiler_utils.py", line 47, in run_pipeline_with_repro_report
    pm.run(module)
RuntimeError: Failure while executing pass pipeline.

During handling of the above exception, another exception occurred:


We should cap it to max of 16. Will send a PR.

powderluv avatar Aug 28 '22 17:08 powderluv

you can workaround this by setting ulimit -n unlimited but still we probably want to cap the max number of parallel threads

powderluv avatar Aug 28 '22 17:08 powderluv

I run it with 100+ cores regularly. What issue are you seeing? / Can you dig more into what is actually failing on your system?

silvasean avatar Aug 29 '22 20:08 silvasean