pytorch-maml icon indicating copy to clipboard operation
pytorch-maml copied to clipboard

train.py throws multiprocessing errors

Open agilebean opened this issue 3 years ago • 1 comments

Executing train.py in Google Colabo throws two kinds of errors from multiprocessing. Here, it would be helpful to know whether these errors invalidate the results or are just informative.

Command executed:

!python train.py /content/sync/data \
--dataset miniimagenet \
--num-ways 5 \
--num-shots 1 \
--step-size 0.1 \
--batch-size 4 \
--num-batches 16 \
--num-epochs 50 \
--num-workers 8 \
--output-folder /content/sync/output \
--use-cuda \
--verbose

ERROR 1:

DEBUG:root:Creating folder `/content/sync/output/2021-01-02_113055`
INFO:root:Saving configuration file in `/content/sync/output/2021-01-02_113055/config.json`
Epoch 1 : 100% 16/16 [00:02<00:00,  5.71it/s, accuracy=0.2315, loss=5.4706]
Epoch 2 : 100% 16/16 [00:02<00:00,  5.67it/s, accuracy=0.2563, loss=3.1795]
Epoch 3 : 100% 16/16 [00:02<00:00,  5.60it/s, accuracy=0.2383, loss=2.8871]
Epoch 4 : 100% 16/16 [00:02<00:00,  5.72it/s, accuracy=0.2448, loss=2.7525]
Training: 100% 16/16 [00:03<00:00,  5.94it/s, accuracy=0.2433, loss=2.1583]Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 240, in _feed
    send_bytes(obj)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

ERROR 2:

Epoch 26: 100% 8/8 [00:01<00:00,  5.01it/s, accuracy=0.2888, loss=1.7093]
Epoch 27: 100% 8/8 [00:01<00:00,  5.07it/s, accuracy=0.2454, loss=1.7756]
Training: 100% 8/8 [00:01<00:00,  5.28it/s, accuracy=0.3267, loss=1.6709]Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 240, in _feed
    send_bytes(obj)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)

agilebean avatar Jan 02 '21 11:01 agilebean

Torchmeta is using PyTorch's DataLoader under the hood for data-loading, so this multiprocessing error must come from PyTorch's DataLoader. I don't know what could cause this issue though unfortunately, but it might be due to Google Colab and how they handle multiprocessing. It might also be related using the synced folder, as in #14. One solution to prevent this error would be to use --num-workers 0, but that will slow down the data-loading part.

tristandeleu avatar Jan 03 '21 15:01 tristandeleu