pfio icon indicating copy to clipboard operation
pfio copied to clipboard

Known issue: setting forkserver mode in multiprocessing module is needed for parallel data load from HDFS

Open belltailjp opened this issue 4 years ago • 3 comments

Deep learning frameworks support multi-process data loading, such as num_worker option of DataLoader in PyTorch, MultiprocessIterator in Chainer, etc. They use multiprocessing module to launch worker processes using fork by default (in Linux). When using PFIO, in case an HDFS connection is established before the fork, information of the connections are also copied to child processes. They are eventually destroyed when one of the workers has completed its work (this happens at the end of each epoch in PyTorch DataLoader). However remaining worker processes still want to keep in touch with HDFS, but since the connection is unexpectedly and uncontrollably closed, they will break.

As far as I know, the actual error message or phenomenon that users face may be different depending on the situation (such as freezing, some strange error like RuntimeError: threads can only be started once, etc), and this makes the troubleshooting even more difficult.

The workaround for this issue is to set multiprocessing module forkserver mode before having access to HDFS. Due to a similar reason (prevent MPI context being broken after fork), ChainerCV and Chainer examples apply the same workaround, and it works for PFIO+HDFS case, too. https://github.com/chainer/chainercv/blob/master/examples/classification/train_imagenet_multi.py#L96-L100 https://github.com/chainer/chainer/blob/df53bff3f36920dfea6b07a5482297d27b31e5b7/examples/chainermn/imagenet/train_imagenet.py#L145-L148

belltailjp avatar May 07 '20 09:05 belltailjp

related issue: #81

belltailjp avatar May 07 '20 09:05 belltailjp

V2 API introduced a proactive fork detection before entering PyArrow functions by checking process ids, and when fork detected, it raises an exception by default. With vanilla Hdfs() class used, developers are now able to detect fork-after-hdfs-init as a bug, and then fix their code and introduce forkserver. What do you think?

kuenishi avatar Mar 05 '21 09:03 kuenishi

Example of checking proc id is like this: https://github.com/pfnet/pfio/pull/151/files#diff-4e49c0f20764e59a31322473b893e889d1163bc77c47758e50c11107f878d498R149

kuenishi avatar Mar 05 '21 09:03 kuenishi