YOLOX icon indicating copy to clipboard operation
YOLOX copied to clipboard

I get the following error when I use multiple gpus, but there is no problem with a single gpu. When the first epoch was trained and started to verify, an error occurred

Open LxxxxK opened this issue 2 years ago • 2 comments

2023-02-17 19:56:58.588 | INFO     | yolox.core.trainer:after_iter:256 - epoch: 1/300, iter: 7390/7393, gpu mem: 21324Mb, mem: 229.9Gb, iter_time: 0.305s, data_time: 0.038s, total_loss: 11.5, iou_loss: 3.5, l1_loss: 0.0, conf_loss: 5.2, cls_loss: 2.8, lr: 9.992e-05, size: 672, ETA: 4 days, 13:20:30
2023-02-17 19:56:58.842 | INFO     | yolox.core.trainer:after_train:196 - Training of experiment is done and the best AP is 0.00
2023-02-17 19:56:58.843 | ERROR    | yolox.core.launch:_distributed_worker:147 - An error has been caught in function '_distributed_worker', process 'ForkProcess-1' (1556755), thread 'MainThread' (139905071462208):
Traceback (most recent call last):

File "/home/anaconda3/envs/yoloxx/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
           │         │     └ {'__name__': '__main__', '__doc__': None, '__package__': 'yolox.tools', '__loader__': <_frozen_importlib_external.SourceFileL...
           │         └ <code object <module> at 0x7f93570dc3a0, file "/data/private/codes/YOLOX/tools/train.py", line 5>
           └ <function _run_code at 0x7f935712fe50>
  File "/home/anaconda3/envs/yoloxx/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
         │     └ {'__name__': '__main__', '__doc__': None, '__package__': 'yolox.tools', '__loader__': <_frozen_importlib_external.SourceFileL...
         └ <code object <module> at 0x7f93570dc3a0, file "/data/private/codes/YOLOX/tools/train.py", line 5>

  File "/data/private/codes/YOLOX/tools/train.py", line 137, in <module>
    launch(
    └ <function launch at 0x7f923bf79ca0>

  File "/data/private/codes/YOLOX/yolox/core/launch.py", line 82, in launch
    mp.start_processes(
    │  └ <function start_processes at 0x7f923cd0fd30>
    └ <module 'torch.multiprocessing' from '/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/multiprocessing...

  File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 189, in start_processes
    process.start()
    │       └ <function BaseProcess.start at 0x7f9356eb5310>
    └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
  File "/home/anaconda3/envs/yoloxx/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
    │    │        │    │      └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
    │    │        │    └ <staticmethod object at 0x7f9356f21c40>
    │    │        └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
    │    └ None
    └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
  File "/home/anaconda3/envs/yoloxx/lib/python3.9/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
           │     └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
           └ <class 'multiprocessing.popen_fork.Popen'>
  File "/home/anaconda3/envs/yoloxx/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
    │    │       └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
    │    └ <function Popen._launch at 0x7f9189a48940>
    └ <multiprocessing.popen_fork.Popen object at 0x7f922e7f2730>
  File "/home/anaconda3/envs/yoloxx/lib/python3.9/multiprocessing/popen_fork.py", line 71, in _launch
    code = process_obj._bootstrap(parent_sentinel=child_r)
           │           │                          └ 8
           │           └ <function BaseProcess._bootstrap at 0x7f9356eb5c10>
           └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
  File "/home/anaconda3/envs/yoloxx/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
    │    └ <function BaseProcess.run at 0x7f9356eb5280>
    └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
  File "/home/anaconda3/envs/yoloxx/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    │    │        │    │        │    └ {}
    │    │        │    │        └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
    │    │        │    └ (<function _distributed_worker at 0x7f923bf79dc0>, 0, (<function main at 0x7f922f47f1f0>, 2, 2, 0, 'nccl', 'tcp://127.0.0.1:5...
    │    │        └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
    │    └ <function _wrap at 0x7f923ccfa5e0>
    └ <ForkProcess name='ForkProcess-1' parent=1689143 started>
  File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
    │  │   └ (<function main at 0x7f922f47f1f0>, 2, 2, 0, 'nccl', 'tcp://127.0.0.1:57425', (╒═══════════════════╤═════════════════════════...
    │  └ 0
    └ <function _distributed_worker at 0x7f923bf79dc0>

> File "/data/private/codes/YOLOX/yolox/core/launch.py", line 147, in _distributed_worker
    main_func(*args)
    │          └ (╒═══════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════...
    └ <function main at 0x7f922f47f1f0>

  File "/data/private/codes/YOLOX/tools/train.py", line 118, in main
    trainer.train()
    │       └ <function Trainer.train at 0x7f922e7f4310>
    └ <yolox.core.trainer.Trainer object at 0x7f9189a72dc0>

  File "/data/private/codes/YOLOX/yolox/core/trainer.py", line 77, in train
    self.train_in_epoch()
    │    └ <function Trainer.train_in_epoch at 0x7f922e7f4af0>
    └ <yolox.core.trainer.Trainer object at 0x7f9189a72dc0>

  File "/data/private/codes/YOLOX/yolox/core/trainer.py", line 86, in train_in_epoch
    self.train_in_iter()
    │    └ <function Trainer.train_in_iter at 0x7f922e7f4b80>
    └ <yolox.core.trainer.Trainer object at 0x7f9189a72dc0>

  File "/data/private/codes/YOLOX/yolox/core/trainer.py", line 92, in train_in_iter
    self.train_one_iter()
    │    └ <function Trainer.train_one_iter at 0x7f922e7f4c10>
    └ <yolox.core.trainer.Trainer object at 0x7f9189a72dc0>

  File "/data/private/codes/YOLOX/yolox/core/trainer.py", line 98, in train_one_iter
    inps, targets = self.prefetcher.next()
                    │    │          └ <function DataPrefetcher.next at 0x7f922e8a61f0>
                    │    └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x7f9197b46250>
                    └ <yolox.core.trainer.Trainer object at 0x7f9189a72dc0>

  File "/data/private/codes/YOLOX/yolox/data/data_prefetcher.py", line 43, in next
    self.preload()
    │    └ <function DataPrefetcher.preload at 0x7f922e8a6160>
    └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x7f9197b46250>

  File "/data/private/codes/YOLOX/yolox/data/data_prefetcher.py", line 25, in preload
    self.next_input, self.next_target, _, _ = next(self.loader)
    │    │           │    │                        │    └ <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f9192a07df0>
    │    │           │    │                        └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x7f9197b46250>
    │    │           │    └ tensor([[[ 43.0000, 416.3666,  92.1444,  59.7350,  63.5518],
    │    │           │               [ 47.0000, 484.5535,  46.3023,  67.1408,  61.3815],
    │    │           │         ...
    │    │           └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x7f9197b46250>
    │    └ tensor([[[[ 57.,  57.,  57.,  ..., 109., 118., 147.],
    │                [ 57.,  57.,  57.,  ..., 109., 124., 148.],
    │                [ 57., ...
    └ <yolox.data.data_prefetcher.DataPrefetcher object at 0x7f9197b46250>

  File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
           │    └ <function _MultiProcessingDataLoaderIter._next_data at 0x7f923c5f4c10>
           └ <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f9192a07df0>
  File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
           │    │             └ <torch._utils.ExceptionWrapper object at 0x7f918bad0940>
           │    └ <function _MultiProcessingDataLoaderIter._process_data at 0x7f923c5f4d30>
           └ <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f9192a07df0>
  File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
    │    └ <function ExceptionWrapper.reraise at 0x7f93564ad0d0>
    └ <torch._utils.ExceptionWrapper object at 0x7f918bad0940>
  File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
          └ IndexError('Caught IndexError in DataLoader worker process 0.\nOriginal Traceback (most recent call last):\n  File "/home/lix...

IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/anaconda3/envs/yoloxx/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/private/codes/YOLOX/yolox/data/datasets/datasets_wrapper.py", line 110, in wrapper
    ret_val = getitem_fn(self, index)
  File "/data/private//codes/YOLOX/yolox/data/datasets/mosaicdetection.py", line 93, in __getitem__
    img, _labels, _, img_id = self._dataset.pull_item(index)
  File "/data/private/codes/YOLOX/yolox/data/datasets/coco.py", line 225, in pull_item
    id_ = self.ids[index]
IndexError: list index out of range

LxxxxK avatar Feb 17 '23 13:02 LxxxxK

from your log:

  File "/data/private/codes/YOLOX/yolox/data/datasets/coco.py", line 225, in pull_item
    id_ = self.ids[index]
IndexError: list index out of range

It's might caused by wrong data format.

Code to check your dataloader(prototype):

dataloader = exp.get_dataloader()
dataloader_iter = iter(dataloader)
while True:
    next(dataloader_iter)

FateScript avatar Feb 28 '23 08:02 FateScript

@FateScript @buzhiqimeiliuqiangdong @natelowry @nihui how to use multigpus any parameter ?

jaideep11061982 avatar Jul 19 '23 09:07 jaideep11061982