YOLOX icon indicating copy to clipboard operation
YOLOX copied to clipboard

RuntimeError: DataLoader worker (pid(s) 197) exited

Open ladyxuxu opened this issue 2 years ago • 2 comments

hi i set the self.data_num_workers = 4 and train command :python ${workspace}/train.py -f ${train_data_dir}/yolox_voc_s.py -d 1 -b 8 -c ${weights_data_dir}/yolox_s.pth

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

2022-07-06 10:48:08 | ERROR | yolox.core.launch:98 - An error has been caught in function 'launch', process 'MainProcess' (36), thread 'MainThread' (139940411148096):

Traceback (most recent call last):

............

File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in next

data = self._next_data()

│ └ <function _MultiProcessingDataLoaderIter._next_data at 0x7f457ba23b70>

└ <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f4574422550>

File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data

idx, data = self._get_data()

│ │ └ <function _MultiProcessingDataLoaderIter._get_data at 0x7f457ba23ae8>

│ └ <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f4574422550>

└ 5

File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1142, in _get_data

success, data = self._try_get_data()

│ │ └ <function _MultiProcessingDataLoaderIter._try_get_data at 0x7f457ba23a60>

│ └ <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f4574422550>

└ False

File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data

raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e

└ '197'

RuntimeError: DataLoader worker (pid(s) 197) exited unexpectedly

when i change it from 4to 0,the train can go on , but it very slow : 2022-07-06 11:00:13 | INFO | yolox.core.trainer:261 - epoch: 8/500, iter: 8730/13381, mem: 1960Mb, iter_time: 1.418s, data_time: 1.261s, total_loss: 4.1, iou_loss: 2.0, l1_loss: 0.0, conf_loss: 1.2, cls_loss: 0.8, lr: 2.499e-03, size: 512, ETA: 94 days, 20:58:37

version: python :3.6.9 pytorch:1.10.1 cuda version:10.2 driver vision:460.56 nvidia gpu:2080ti yolox:0.3.0

ladyxuxu avatar Jul 06 '22 03:07 ladyxuxu

You log says: ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Reduce the data_num_workers in exp might help you. 2 is suggested and 0 should be your last choice.

FateScript avatar Jul 06 '22 03:07 FateScript

Is it the code you run in docker? You may need to increase the shared memory (shm) of docker settings

laborer123 avatar Aug 05 '22 01:08 laborer123