Mask_RCNN icon indicating copy to clipboard operation
Mask_RCNN copied to clipboard

BrokenPipeError: [Errno 32] Broken pipe arises half-way when training on 2 GPUs

Open YesDargo opened this issue 6 years ago • 14 comments

Similar to coco.py, I modifies the shape.ipynb to train on my dataset and first stage training on heads completed but the EOFError arises half-way during the first epoch training on 4+. The info is as below, I can train it on single GPU without this error.

There are 2k images in total and config like below. Is there problem with step_per_epoch or validation_steps?

GPU_COUNT = 2 IMAGES_PER_GPU = 5 STEPS_PER_EPOCH = 200 VALIDATION_STEPS = 200

Epoch 42/120
133/200 [==================>...........] - ETA: 3:13 - loss: 0.2841 - rpn_class_loss: 0.0012 - rpn_bbox_loss: 0.0529 - mrcnn_class_loss: 0.0424 - mrcnn_bbox_loss: 0.0377 - mrcnn_mask_loss: 0.150

Traceback (most recent call last):

134/200 [===================>..........] - ETA: 3:12 - loss: 0.2848 - rpn_class_loss: 0.0012 - rpn_bbox_loss: 0.0531 - mrcnn_class_loss: 0.0426 - mrcnn_bbox_loss: 0.0376 - mrcnn_mask_loss: 0.1502
---------------------------------------------------------------------------
BrokenPipeError                           Traceback (most recent call last)
<ipython-input-10-63fa505ed006> in <module>()
      4             learning_rate=config.LEARNING_RATE / 10,
      5             epochs=120,
----> 6             layers="4+")

~/Mask_RCNN/model.py in train(self, train_dataset, val_dataset, learning_rate, epochs, layers)
   2236             max_queue_size=100,
   2237             workers=workers,
-> 2238             use_multiprocessing=True,
   2239         )
   2240         self.epoch = max(self.epoch, epochs)

~/miniconda3/envs/tf3/lib/python3.6/site-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
     89                 warnings.warn('Update your `' + object_name +
     90                               '` call to the Keras 2 API: ' + signature, stacklevel=2)
---> 91             return func(*args, **kwargs)
     92         wrapper._original_function = func
     93         return wrapper

~/miniconda3/envs/tf3/lib/python3.6/site-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   2210                 batch_index = 0
   2211                 while steps_done < steps_per_epoch:
-> 2212                     generator_output = next(output_generator)
   2213 
   2214                     if not hasattr(generator_output, '__len__'):

~/miniconda3/envs/tf3/lib/python3.6/site-packages/keras/utils/data_utils.py in get(self)
    758         """
    759         while self.is_running():
--> 760             if not self.queue.empty():
    761                 success, value = self.queue.get()
    762                 # Rethrow any exceptions found in the queue

<string> in empty(self, *args, **kwds)

~/miniconda3/envs/tf3/lib/python3.6/multiprocessing/managers.py in _callmethod(self, methodname, args, kwds)
    754             conn = self._tls.connection
    755 
--> 756         conn.send((self._id, methodname, args, kwds))
    757         kind, result = conn.recv()
    758 

~/miniconda3/envs/tf3/lib/python3.6/multiprocessing/connection.py in send(self, obj)
    204         self._check_closed()
    205         self._check_writable()
--> 206         self._send_bytes(_ForkingPickler.dumps(obj))
    207 
    208     def recv_bytes(self, maxlength=None):

~/miniconda3/envs/tf3/lib/python3.6/multiprocessing/connection.py in _send_bytes(self, buf)
    402             # Also note we want to avoid sending a 0-length buffer separately,
    403             # to avoid "broken pipe" errors if the other end closed the pipe.
--> 404             self._send(header + buf)
    405 
    406     def _recv_bytes(self, maxsize=None):

~/miniconda3/envs/tf3/lib/python3.6/multiprocessing/connection.py in _send(self, buf, write)
    366         remaining = len(buf)
    367         while True:
--> 368             n = write(self._handle, buf)
    369             remaining -= n
    370             if remaining == 0:

BrokenPipeError: [Errno 32] Broken pipe

YesDargo avatar Feb 21 '18 03:02 YesDargo

Hey @Xyza1972 - can I ask how did you prepare you own data? what tool you use to create label and mask in the VOC format?

billybee avatar Feb 22 '18 06:02 billybee

I am doing binary task and each image only has one instance. I think data preparation is fine since I can train them successfully using one GPU and it also can train some steps using 2 GPUs then broken pipe error occured.

The training is on server and connected via ssh.

YesDargo avatar Feb 23 '18 01:02 YesDargo

I'm getting this too! same error at the same exact point in training.

Used pycococreator for the data. Have some very high resolution images (up to 6k), 3+1 classes.

Training in a nvidia-docker on GCE with 4 GPUs (P100's), 8 CPU, 37GB ram.

Switching from 4 GPUs to 1 seems to fix the issue.

austinmw avatar May 11 '18 00:05 austinmw

I think this issue is critical, can we put some effort to fix it?

orestis-z avatar May 18 '18 23:05 orestis-z

Do you have any solutions to solve it? I got the same problem and found some solutions, and still have this error

File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes self._send(header + buf) File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

I also tried to setup the parameters as below but still this error: IMAGES_PER_GPU = 1 USE_MINI_MASK = False

This problems may be occurred when I increasing the datasets.

derelearnro avatar May 29 '18 12:05 derelearnro

are there any resolution to this? getting the same error

Gan-Tu avatar Jun 30 '18 07:06 Gan-Tu

Yes, I solved it. Please reduce you image_pers_gpu as much as you can. It works fine for me now.

derelearnro avatar Jul 01 '18 10:07 derelearnro

I am getting the same issue using multiple GPUs. I have tried setting my images_per_gpu to be 1, and use_mini_mask = False. None of them solves the BrokenPipe problem. I am using Keras 2.1.3, python 3.6 in anaconda on Ubuntu16.4, GeForce GTX graphics card with 8 CPUs.

JiangHao3 avatar Jul 16 '19 18:07 JiangHao3

I am getting the same issue using multiple GPUs. I have tried setting my images_per_gpu to be 1, and use_mini_mask = False. None of them solves the BrokenPipe problem. I am using Keras 2.1.3, python 3.6 in anaconda on Ubuntu16.4, GeForce GTX graphics card with 8 CPUs.

You can try to train on a smaller dataset, or use BACKBONE = "resnet50" instead of BACKBONE = "resnet101" by default.

truongtd6285 avatar Jan 15 '20 02:01 truongtd6285

I am facing the same issue, but I am using CPU to train.

shradhakarnawat avatar Feb 05 '20 13:02 shradhakarnawat

I have the same issue. I have Nvidia 1080tis and 2080ti. The problem only occurs when I use the 2080ti for training but same code works perfectly fine on 1080ti. I figured this has to do with libraries, cuda, and other compatibility stuff. For now I've abandoned using the 2080ti and will try to reinstall all libraries when I get a chance.

AloshkaD avatar Feb 26 '20 15:02 AloshkaD

I am facing the same issue, but I am using CPU to train.

Me too. I use one GPU and using generator to create dataset, model.fit to train model. And after some epochs, raise thie error.

murdockhou avatar Aug 03 '20 06:08 murdockhou

I have tried to set use_multiprocessing to False, and it works, but really slow. I was testing on ubuntu 20 with 1070ti and AMD 3700X.

SYLin117 avatar Aug 15 '21 09:08 SYLin117

I get the same issue with just 1 GPU when using multiprocessing. It seems to only come up when my data generator is using too many compute resources. Things that help are:

  1. reduce the batch size
  2. reduce the number of workers
  3. reduce the max queue size
  4. keep any processing in the data generator simple.
  5. Don't use multiprocessing (very slow)

alexberian avatar Aug 31 '22 00:08 alexberian