data
data copied to clipboard
MacOS state_dict tests in CI are failing during shutdown
🐛 Describe the bug
MacOS tests of StatefulDataLoader CI action fail intermittently during shutdown. on Mac it also takes a lot longer than both windows and ubuntu to shut down (10 minutes vs 1s). I'm not sure what causes Github Actions to mark the test as failed, but Created an issue here on actions/setup-python but still no response: https://github.com/actions/setup-python/issues/857
Although we still get positive signals from the test, it shows up as an X in Github
Versions
Nightly / main branch in CI,
I've been trying to isolate the problem here on this branch https://github.com/pytorch/data/pull/1255. I'm unable to repro on my mac laptop, so i'm just trying to bisect it by kicking off so far it's definitely due to test_state_dict.py.
The best sign I get is sometimes the complete jobs
or cleanup python
logs will show a bunch of Terminate orphan process: pid (xxxxx) (torch_shm_manager).
Digging in to the docs and code, it looks like on MacOS, the default sharing strategy is file_system (instead of file_descriptor) which will launch torch_shm_manager process in the background. It gets launched here, but the PID is never held on to, and there is no obvious clean up code that gets called here. https://github.com/pytorch/pytorch/blob/main/torch/lib/libshm/core.cpp?fbclid=IwAR0DG3o68svdVDUkMCbb-0KM95IzxpsAeWS27m57fWAx84su9stZbsa3H_4#L25
https://github.com/pytorch/pytorch/blob/main/torch/lib/libshm/core.cpp?fbclid=IwAR0DG3o68svdVDUkMCbb-0KM95IzxpsAeWS27m57fWAx84su9stZbsa3H_4#L25
It seems like on MacOS, multiprocessing fork is more like a spawn and requires importing all the modules again. Something about increasing the total number of worker subprocesses in the test causes massive slowdowns in clean up. The simplest thing to do at this point is to shard the tests. I'll probably give this a shot tomorrow