transformers
transformers copied to clipboard
Run torch_fx tests in a spawn process to avoid memory issue
The documentation is not available anymore as the PR was closed or merged.
-
without new process
- 2~3 minutes for 100 runs
- 15 MB leak per run
-
with
fork- 5 minutes for 100 runs
- 1 MB leak per run
- hangs if
MKL_NUM_THREADS> 1
-
with
spawn- 30 minutes for 100 runs
- 1 MB leak per run
When , using the new process approach, in some cases, setting ulimit -n 2048 is necessary.
(For example, running the same test with a loop)
Otherwise, we might get the following error:
tests/models/bart/test_modeling_bart.py::BartModelTest::test_torch_fx Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/queues.py", line 245, in _feed
File "/usr/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
File "/home/yih_dar_huggingface_co/.local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 358, in reduce_storage
RuntimeError: unable to open shared memory object </torch_46201_690006289_939> in read-write mode: Too many open files (24)
More details:
> ???
tests/test_modeling_common.py:769:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_modeling_common.py:866: in _create_and_check_torch_fx_tracing
???
/usr/lib/python3.9/multiprocessing/process.py:121: in start
???
/usr/lib/python3.9/multiprocessing/context.py:277: in _Popen
???
/usr/lib/python3.9/multiprocessing/popen_fork.py:19: in __init__
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <multiprocessing.popen_fork.Popen object at 0x7fa12a499820>, process_obj = <ForkProcess name='ForkProcess-10' parent=46201 initial>
> ???
E OSError: [Errno 24] Too many open files
/usr/lib/python3.9/multiprocessing/popen_fork.py:64: OSError
This seems to relate to torch multiprocessing: https://discuss.pytorch.org/t/runtimeerror-unable-to-open-shared-memory-object-depending-on-the-model/116090
Another related issue (not torch): https://github.com/lava-nc/lava/issues/71
With GPU, we have to use spawn, otherwise
Process ForkProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/transformers/tests/test_modeling_common.py", line 143, in _run_torch_jit
model, input_names, filtered_inputs = in_queue.get(timeout=30)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/reductions.py", line 112, in rebuild_cuda_tensor
torch.cuda._lazy_init()
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 207, in _lazy_init
raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
I think it's safe to only run those tests on CPU. Also, when running locally it takes ~ 1 min (althought I agree my machine might be more powerful).
@michaelbenayoun
I move (almost) the whole testing logic to the child process. On more advantage here is to create the model in the child process, so we don't need to pass it between the process.
Now running 100 times, we have only (per run) 0.15 MB increase of memory usage.
@michaelbenayoun You are right, some model overwrites _create_and_check_torch_fx_tracing. This won't fail this PR however: those models will just run the test_torch_fx* tests in the current manner (i.e. not in the child process). I will take a look if those overwritting are necessary. In any case, we can merge this PR as it is (if you are happy with it), and I will work on those models later.
I think it's okay now with the changes you've made!
I think it's okay now with the changes you've made!
Would love to have a approval from you, @michaelbenayoun. But no need to rush - as long as you finally happy with the change and click the button.
ready for @sgugger and/or @LysandreJik to have a final check 🚀
I will merge this afternoon, after adding a short command in _create_and_check_torch_fx_tracing explaining why we need this change, with a link to #18525
Hi @michaelbenayoun, I just saw that I fixed a similar issue a few months ago
https://github.com/huggingface/transformers/blob/fbf382c84da4506484a23e85bd8540da5192ff4e/tests/test_modeling_common.py#L719
(for _create_and_check_torchscript). I am going to change this PR to simply apply that fix. Is it OK for you?
Changed the PR to simply call clear_torch_jit_class_registry. Test failure is irrelevant to this PR - merge now.