data icon indicating copy to clipboard operation
data copied to clipboard

MPRS blocks indefinitely

Open sehoffmann opened this issue 1 year ago • 3 comments

🐛 Describe the bug

  1. Create a pipeline and use MPRS with 'spawn' (same behavior might potentially also be replicated with 'fork')
  2. Have the worker process fail to start due to an error during module import (other scenarios might potentially also provoke this)
  3. Main process now blocks endlessly in [x for x in dataloader]
  4. It's only possible to terminate the process by sending an KeyboardInterupt.

Solution: Add a timeout parameter to the MPRS constructor and make

File "torchdata/dataloader2/communication/protocol.py", line 104, in get_new_request
    response = self.request_queue.get(block=block)

fail after that timeout period.

Versions

https://github.com/pytorch/data/commit/e78ab6c9ec94f05f0a350ced7fe571f6863c20ec

sehoffmann avatar Mar 24 '23 16:03 sehoffmann

Thanks for reporting this.

Do you have a simple script to re-create this behavior?

NivekT avatar Mar 24 '23 19:03 NivekT

Unfortunately, no. The errors during import in the worker processes are due to my environment being screwed up and import order actually matters in in that regard.

I unfortunately currently have no free capacities to specifically create a minimal reproducible example for you. All I can do right now is to report my observations.

sehoffmann avatar Mar 24 '23 19:03 sehoffmann

I suppose having a module that checks, during import, whether the current process is a child process and if so throws an error would be sufficient to provoke this.

sehoffmann avatar Mar 24 '23 20:03 sehoffmann