data icon indicating copy to clipboard operation
data copied to clipboard

MacOS state_dict tests in CI are failing during shutdown

Open andrewkho opened this issue 9 months ago • 2 comments

🐛 Describe the bug

MacOS tests of StatefulDataLoader CI action fail intermittently during shutdown. on Mac it also takes a lot longer than both windows and ubuntu to shut down (10 minutes vs 1s). I'm not sure what causes Github Actions to mark the test as failed, but Created an issue here on actions/setup-python but still no response: https://github.com/actions/setup-python/issues/857

Although we still get positive signals from the test, it shows up as an X in Github

Versions

Nightly / main branch in CI,

andrewkho avatar May 09 '24 23:05 andrewkho

I've been trying to isolate the problem here on this branch https://github.com/pytorch/data/pull/1255. I'm unable to repro on my mac laptop, so i'm just trying to bisect it by kicking off so far it's definitely due to test_state_dict.py.

The best sign I get is sometimes the complete jobs or cleanup python logs will show a bunch of Terminate orphan process: pid (xxxxx) (torch_shm_manager).

Digging in to the docs and code, it looks like on MacOS, the default sharing strategy is file_system (instead of file_descriptor) which will launch torch_shm_manager process in the background. It gets launched here, but the PID is never held on to, and there is no obvious clean up code that gets called here. https://github.com/pytorch/pytorch/blob/main/torch/lib/libshm/core.cpp?fbclid=IwAR0DG3o68svdVDUkMCbb-0KM95IzxpsAeWS27m57fWAx84su9stZbsa3H_4#L25

https://github.com/pytorch/pytorch/blob/main/torch/lib/libshm/core.cpp?fbclid=IwAR0DG3o68svdVDUkMCbb-0KM95IzxpsAeWS27m57fWAx84su9stZbsa3H_4#L25

image

andrewkho avatar May 09 '24 23:05 andrewkho

It seems like on MacOS, multiprocessing fork is more like a spawn and requires importing all the modules again. Something about increasing the total number of worker subprocesses in the test causes massive slowdowns in clean up. The simplest thing to do at this point is to shard the tests. I'll probably give this a shot tomorrow

andrewkho avatar May 10 '24 02:05 andrewkho