YOLOX
YOLOX copied to clipboard
I get the following error when I use multiple gpus, but there is no problem with a single gpu. When the first epoch was trained and started to verify, an error occurred
2023-02-17 19:56:58.588 | INFO | yolox.core.trainer:after_iter:256 - epoch: 1/300, iter: 7390/7393, gpu mem: 21324Mb, mem: 229.9Gb, iter_time: 0.305s, data_time: 0.038s, total_loss: 11.5, iou_loss: 3.5, l1_loss: 0.0, conf_loss: 5.2, cls_loss: 2.8, lr: 9.992e-05, size: 672, ETA: 4 days, 13:20:30
2023-02-17 19:56:58.842 | INFO | yolox.core.trainer:after_train:196 - Training of experiment is done and the best AP is 0.00
2023-02-17 19:56:58.843 | ERROR | yolox.core.launch:_distributed_worker:147 - An error has been caught in function '_distributed_worker', process 'ForkProcess-1' (1556755), thread 'MainThread' (139905071462208):
Traceback (most recent call last):
File "/home/anaconda3/envs/yoloxx/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
│ │ └ {'__name__': '__main__', '__doc__': None, '__package__': 'yolox.tools', '__loader__': <_frozen_importlib_external.SourceFileL...
│ └ <code object <module> at 0x7f93570dc3a0, file "/data/private/codes/YOLOX/tools/train.py", line 5>
└ <function _run_code at 0x7f935712fe50>
File "/home/anaconda3/envs/yoloxx/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
│ └ {'__name__': '__main__', '__doc__': None, '__package__': 'yolox.tools', '__loader__': <_frozen_importlib_external.SourceFileL...
└ <code object <module> at 0x7f93570dc3a0, file "/data/private/codes/YOLOX/tools/train.py", line 5>
File "/data/private/codes/YOLOX/tools/train.py", line 137, in <module>
launch(
└ <function launch at 0x7f923bf79ca0>
File "/data/private/codes/YOLOX/yolox/core/launch.py", line 82, in launch
mp.start_processes(
│ └ <function start_processes at 0x7f923cd0fd30>
└ <module 'torch.multiprocessing' from '/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/multiprocessing...
File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 189, in start_processes
process.start()
│ └ <function BaseProcess.start at 0x7f9356eb5310>
└ <ForkProcess name='ForkProcess-1' parent=1689143 started>
File "/home/anaconda3/envs/yoloxx/lib/python3.9/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
│ │ │ │ └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
│ │ │ └ <staticmethod object at 0x7f9356f21c40>
│ │ └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
│ └ None
└ <ForkProcess name='ForkProcess-1' parent=1689143 started>
File "/home/anaconda3/envs/yoloxx/lib/python3.9/multiprocessing/context.py", line 277, in _Popen
return Popen(process_obj)
│ └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
└ <class 'multiprocessing.popen_fork.Popen'>
File "/home/anaconda3/envs/yoloxx/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
│ │ └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
│ └ <function Popen._launch at 0x7f9189a48940>
└ <multiprocessing.popen_fork.Popen object at 0x7f922e7f2730>
File "/home/anaconda3/envs/yoloxx/lib/python3.9/multiprocessing/popen_fork.py", line 71, in _launch
code = process_obj._bootstrap(parent_sentinel=child_r)
│ │ └ 8
│ └ <function BaseProcess._bootstrap at 0x7f9356eb5c10>
└ <ForkProcess name='ForkProcess-1' parent=1689143 started>
File "/home/anaconda3/envs/yoloxx/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
│ └ <function BaseProcess.run at 0x7f9356eb5280>
└ <ForkProcess name='ForkProcess-1' parent=1689143 started>
File "/home/anaconda3/envs/yoloxx/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
│ │ │ │ │ └ {}
│ │ │ │ └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
│ │ │ └ (<function _distributed_worker at 0x7f923bf79dc0>, 0, (<function main at 0x7f922f47f1f0>, 2, 2, 0, 'nccl', 'tcp://127.0.0.1:5...
│ │ └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
│ └ <function _wrap at 0x7f923ccfa5e0>
└ <ForkProcess name='ForkProcess-1' parent=1689143 started>
File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
│ │ └ (<function main at 0x7f922f47f1f0>, 2, 2, 0, 'nccl', 'tcp://127.0.0.1:57425', (╒═══════════════════╤═════════════════════════...
│ └ 0
└ <function _distributed_worker at 0x7f923bf79dc0>
> File "/data/private/codes/YOLOX/yolox/core/launch.py", line 147, in _distributed_worker
main_func(*args)
│ └ (╒═══════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════...
└ <function main at 0x7f922f47f1f0>
File "/data/private/codes/YOLOX/tools/train.py", line 118, in main
trainer.train()
│ └ <function Trainer.train at 0x7f922e7f4310>
└ <yolox.core.trainer.Trainer object at 0x7f9189a72dc0>
File "/data/private/codes/YOLOX/yolox/core/trainer.py", line 77, in train
self.train_in_epoch()
│ └ <function Trainer.train_in_epoch at 0x7f922e7f4af0>
└ <yolox.core.trainer.Trainer object at 0x7f9189a72dc0>
File "/data/private/codes/YOLOX/yolox/core/trainer.py", line 86, in train_in_epoch
self.train_in_iter()
│ └ <function Trainer.train_in_iter at 0x7f922e7f4b80>
└ <yolox.core.trainer.Trainer object at 0x7f9189a72dc0>
File "/data/private/codes/YOLOX/yolox/core/trainer.py", line 92, in train_in_iter
self.train_one_iter()
│ └ <function Trainer.train_one_iter at 0x7f922e7f4c10>
└ <yolox.core.trainer.Trainer object at 0x7f9189a72dc0>
File "/data/private/codes/YOLOX/yolox/core/trainer.py", line 98, in train_one_iter
inps, targets = self.prefetcher.next()
│ │ └ <function DataPrefetcher.next at 0x7f922e8a61f0>
│ └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x7f9197b46250>
└ <yolox.core.trainer.Trainer object at 0x7f9189a72dc0>
File "/data/private/codes/YOLOX/yolox/data/data_prefetcher.py", line 43, in next
self.preload()
│ └ <function DataPrefetcher.preload at 0x7f922e8a6160>
└ <yolox.data.data_prefetcher.DataPrefetcher object at 0x7f9197b46250>
File "/data/private/codes/YOLOX/yolox/data/data_prefetcher.py", line 25, in preload
self.next_input, self.next_target, _, _ = next(self.loader)
│ │ │ │ │ └ <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f9192a07df0>
│ │ │ │ └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x7f9197b46250>
│ │ │ └ tensor([[[ 43.0000, 416.3666, 92.1444, 59.7350, 63.5518],
│ │ │ [ 47.0000, 484.5535, 46.3023, 67.1408, 61.3815],
│ │ │ ...
│ │ └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x7f9197b46250>
│ └ tensor([[[[ 57., 57., 57., ..., 109., 118., 147.],
│ [ 57., 57., 57., ..., 109., 124., 148.],
│ [ 57., ...
└ <yolox.data.data_prefetcher.DataPrefetcher object at 0x7f9197b46250>
File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
│ └ <function _MultiProcessingDataLoaderIter._next_data at 0x7f923c5f4c10>
└ <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f9192a07df0>
File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
return self._process_data(data)
│ │ └ <torch._utils.ExceptionWrapper object at 0x7f918bad0940>
│ └ <function _MultiProcessingDataLoaderIter._process_data at 0x7f923c5f4d30>
└ <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f9192a07df0>
File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
data.reraise()
│ └ <function ExceptionWrapper.reraise at 0x7f93564ad0d0>
└ <torch._utils.ExceptionWrapper object at 0x7f918bad0940>
File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise
raise exception
└ IndexError('Caught IndexError in DataLoader worker process 0.\nOriginal Traceback (most recent call last):\n File "/home/lix...
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/private/codes/YOLOX/yolox/data/datasets/datasets_wrapper.py", line 110, in wrapper
ret_val = getitem_fn(self, index)
File "/data/private//codes/YOLOX/yolox/data/datasets/mosaicdetection.py", line 93, in __getitem__
img, _labels, _, img_id = self._dataset.pull_item(index)
File "/data/private/codes/YOLOX/yolox/data/datasets/coco.py", line 225, in pull_item
id_ = self.ids[index]
IndexError: list index out of range
from your log:
File "/data/private/codes/YOLOX/yolox/data/datasets/coco.py", line 225, in pull_item
id_ = self.ids[index]
IndexError: list index out of range
It's might caused by wrong data format.
Code to check your dataloader(prototype):
dataloader = exp.get_dataloader()
dataloader_iter = iter(dataloader)
while True:
next(dataloader_iter)
@FateScript @buzhiqimeiliuqiangdong @natelowry @nihui how to use multigpus any parameter ?