Mask_RCNN
Mask_RCNN copied to clipboard
BrokenPipeError: [Errno 32] Broken pipe arises half-way when training on 2 GPUs
Similar to coco.py, I modifies the shape.ipynb to train on my dataset and first stage training on heads completed but the EOFError arises half-way during the first epoch training on 4+. The info is as below, I can train it on single GPU without this error.
There are 2k images in total and config like below. Is there problem with step_per_epoch or validation_steps?
GPU_COUNT = 2 IMAGES_PER_GPU = 5 STEPS_PER_EPOCH = 200 VALIDATION_STEPS = 200
Epoch 42/120
133/200 [==================>...........] - ETA: 3:13 - loss: 0.2841 - rpn_class_loss: 0.0012 - rpn_bbox_loss: 0.0529 - mrcnn_class_loss: 0.0424 - mrcnn_bbox_loss: 0.0377 - mrcnn_mask_loss: 0.150
Traceback (most recent call last):
134/200 [===================>..........] - ETA: 3:12 - loss: 0.2848 - rpn_class_loss: 0.0012 - rpn_bbox_loss: 0.0531 - mrcnn_class_loss: 0.0426 - mrcnn_bbox_loss: 0.0376 - mrcnn_mask_loss: 0.1502
---------------------------------------------------------------------------
BrokenPipeError Traceback (most recent call last)
<ipython-input-10-63fa505ed006> in <module>()
4 learning_rate=config.LEARNING_RATE / 10,
5 epochs=120,
----> 6 layers="4+")
~/Mask_RCNN/model.py in train(self, train_dataset, val_dataset, learning_rate, epochs, layers)
2236 max_queue_size=100,
2237 workers=workers,
-> 2238 use_multiprocessing=True,
2239 )
2240 self.epoch = max(self.epoch, epochs)
~/miniconda3/envs/tf3/lib/python3.6/site-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
89 warnings.warn('Update your `' + object_name +
90 '` call to the Keras 2 API: ' + signature, stacklevel=2)
---> 91 return func(*args, **kwargs)
92 wrapper._original_function = func
93 return wrapper
~/miniconda3/envs/tf3/lib/python3.6/site-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
2210 batch_index = 0
2211 while steps_done < steps_per_epoch:
-> 2212 generator_output = next(output_generator)
2213
2214 if not hasattr(generator_output, '__len__'):
~/miniconda3/envs/tf3/lib/python3.6/site-packages/keras/utils/data_utils.py in get(self)
758 """
759 while self.is_running():
--> 760 if not self.queue.empty():
761 success, value = self.queue.get()
762 # Rethrow any exceptions found in the queue
<string> in empty(self, *args, **kwds)
~/miniconda3/envs/tf3/lib/python3.6/multiprocessing/managers.py in _callmethod(self, methodname, args, kwds)
754 conn = self._tls.connection
755
--> 756 conn.send((self._id, methodname, args, kwds))
757 kind, result = conn.recv()
758
~/miniconda3/envs/tf3/lib/python3.6/multiprocessing/connection.py in send(self, obj)
204 self._check_closed()
205 self._check_writable()
--> 206 self._send_bytes(_ForkingPickler.dumps(obj))
207
208 def recv_bytes(self, maxlength=None):
~/miniconda3/envs/tf3/lib/python3.6/multiprocessing/connection.py in _send_bytes(self, buf)
402 # Also note we want to avoid sending a 0-length buffer separately,
403 # to avoid "broken pipe" errors if the other end closed the pipe.
--> 404 self._send(header + buf)
405
406 def _recv_bytes(self, maxsize=None):
~/miniconda3/envs/tf3/lib/python3.6/multiprocessing/connection.py in _send(self, buf, write)
366 remaining = len(buf)
367 while True:
--> 368 n = write(self._handle, buf)
369 remaining -= n
370 if remaining == 0:
BrokenPipeError: [Errno 32] Broken pipe
Hey @Xyza1972 - can I ask how did you prepare you own data? what tool you use to create label and mask in the VOC format?
I am doing binary task and each image only has one instance. I think data preparation is fine since I can train them successfully using one GPU and it also can train some steps using 2 GPUs then broken pipe error occured.
The training is on server and connected via ssh.
I'm getting this too! same error at the same exact point in training.
Used pycococreator for the data. Have some very high resolution images (up to 6k), 3+1 classes.
Training in a nvidia-docker on GCE with 4 GPUs (P100's), 8 CPU, 37GB ram.
Switching from 4 GPUs to 1 seems to fix the issue.
I think this issue is critical, can we put some effort to fix it?
Do you have any solutions to solve it? I got the same problem and found some solutions, and still have this error
File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes self._send(header + buf) File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe
I also tried to setup the parameters as below but still this error: IMAGES_PER_GPU = 1 USE_MINI_MASK = False
This problems may be occurred when I increasing the datasets.
are there any resolution to this? getting the same error
Yes, I solved it. Please reduce you image_pers_gpu as much as you can. It works fine for me now.
I am getting the same issue using multiple GPUs. I have tried setting my images_per_gpu to be 1, and use_mini_mask = False. None of them solves the BrokenPipe problem. I am using Keras 2.1.3, python 3.6 in anaconda on Ubuntu16.4, GeForce GTX graphics card with 8 CPUs.
I am getting the same issue using multiple GPUs. I have tried setting my images_per_gpu to be 1, and use_mini_mask = False. None of them solves the BrokenPipe problem. I am using Keras 2.1.3, python 3.6 in anaconda on Ubuntu16.4, GeForce GTX graphics card with 8 CPUs.
You can try to train on a smaller dataset, or use BACKBONE = "resnet50" instead of BACKBONE = "resnet101" by default.
I am facing the same issue, but I am using CPU to train.
I have the same issue. I have Nvidia 1080tis and 2080ti. The problem only occurs when I use the 2080ti for training but same code works perfectly fine on 1080ti. I figured this has to do with libraries, cuda, and other compatibility stuff. For now I've abandoned using the 2080ti and will try to reinstall all libraries when I get a chance.
I am facing the same issue, but I am using CPU to train.
Me too. I use one GPU and using generator to create dataset, model.fit to train model. And after some epochs, raise thie error.
I have tried to set use_multiprocessing to False, and it works, but really slow. I was testing on ubuntu 20 with 1070ti and AMD 3700X.
I get the same issue with just 1 GPU when using multiprocessing. It seems to only come up when my data generator is using too many compute resources. Things that help are:
- reduce the batch size
- reduce the number of workers
- reduce the max queue size
- keep any processing in the data generator simple.
- Don't use multiprocessing (very slow)