transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Run torch_fx tests in a spawn process to avoid memory issue

Open ydshieh opened this issue 3 years ago • 1 comments

What does this PR do?

Run torch_fx tests in a spawn process to avoid memory issue.

ydshieh avatar Aug 09 '22 16:08 ydshieh

The documentation is not available anymore as the PR was closed or merged.

  • without new process

    • 2~3 minutes for 100 runs
    • 15 MB leak per run
  • with fork

    • 5 minutes for 100 runs
    • 1 MB leak per run
    • hangs if MKL_NUM_THREADS > 1
  • with spawn

    • 30 minutes for 100 runs
    • 1 MB leak per run

ydshieh avatar Aug 10 '22 07:08 ydshieh

When , using the new process approach, in some cases, setting ulimit -n 2048 is necessary. (For example, running the same test with a loop)

Otherwise, we might get the following error:

tests/models/bart/test_modeling_bart.py::BartModelTest::test_torch_fx Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/queues.py", line 245, in _feed
  File "/usr/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
  File "/home/yih_dar_huggingface_co/.local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 358, in reduce_storage
RuntimeError: unable to open shared memory object </torch_46201_690006289_939> in read-write mode: Too many open files (24)

More details:

>   ???

tests/test_modeling_common.py:769: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/test_modeling_common.py:866: in _create_and_check_torch_fx_tracing
    ???
/usr/lib/python3.9/multiprocessing/process.py:121: in start
    ???
/usr/lib/python3.9/multiprocessing/context.py:277: in _Popen
    ???
/usr/lib/python3.9/multiprocessing/popen_fork.py:19: in __init__
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <multiprocessing.popen_fork.Popen object at 0x7fa12a499820>, process_obj = <ForkProcess name='ForkProcess-10' parent=46201 initial>

>   ???
E   OSError: [Errno 24] Too many open files

/usr/lib/python3.9/multiprocessing/popen_fork.py:64: OSError

This seems to relate to torch multiprocessing: https://discuss.pytorch.org/t/runtimeerror-unable-to-open-shared-memory-object-depending-on-the-model/116090

Another related issue (not torch): https://github.com/lava-nc/lava/issues/71

ydshieh avatar Aug 10 '22 07:08 ydshieh

With GPU, we have to use spawn, otherwise

Process ForkProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/transformers/tests/test_modeling_common.py", line 143, in _run_torch_jit
    model, input_names, filtered_inputs = in_queue.get(timeout=30)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/reductions.py", line 112, in rebuild_cuda_tensor
    torch.cuda._lazy_init()
  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 207, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

ydshieh avatar Aug 10 '22 09:08 ydshieh

I think it's safe to only run those tests on CPU. Also, when running locally it takes ~ 1 min (althought I agree my machine might be more powerful).

michaelbenayoun avatar Aug 11 '22 12:08 michaelbenayoun

@michaelbenayoun

I move (almost) the whole testing logic to the child process. On more advantage here is to create the model in the child process, so we don't need to pass it between the process.

Now running 100 times, we have only (per run) 0.15 MB increase of memory usage.

ydshieh avatar Aug 12 '22 11:08 ydshieh

@michaelbenayoun You are right, some model overwrites _create_and_check_torch_fx_tracing. This won't fail this PR however: those models will just run the test_torch_fx* tests in the current manner (i.e. not in the child process). I will take a look if those overwritting are necessary. In any case, we can merge this PR as it is (if you are happy with it), and I will work on those models later.

ydshieh avatar Aug 17 '22 13:08 ydshieh

I think it's okay now with the changes you've made!

michaelbenayoun avatar Aug 22 '22 10:08 michaelbenayoun

I think it's okay now with the changes you've made!

Would love to have a approval from you, @michaelbenayoun. But no need to rush - as long as you finally happy with the change and click the button.

ydshieh avatar Aug 22 '22 11:08 ydshieh

ready for @sgugger and/or @LysandreJik to have a final check 🚀

ydshieh avatar Aug 22 '22 14:08 ydshieh

I will merge this afternoon, after adding a short command in _create_and_check_torch_fx_tracing explaining why we need this change, with a link to #18525

ydshieh avatar Aug 24 '22 10:08 ydshieh

Hi @michaelbenayoun, I just saw that I fixed a similar issue a few months ago

https://github.com/huggingface/transformers/blob/fbf382c84da4506484a23e85bd8540da5192ff4e/tests/test_modeling_common.py#L719

(for _create_and_check_torchscript). I am going to change this PR to simply apply that fix. Is it OK for you?

ydshieh avatar Aug 25 '22 14:08 ydshieh

Changed the PR to simply call clear_torch_jit_class_registry. Test failure is irrelevant to this PR - merge now.

ydshieh avatar Aug 29 '22 09:08 ydshieh