ByteTrack icon indicating copy to clipboard operation
ByteTrack copied to clipboard

ERROR | yolox.core.launch:90 - An error has been caught in function 'launch', process 'MainProcess'

Open lida2003 opened this issue 1 year ago • 4 comments

I have got below error on jetson orin nano board, anyone met this before?

daniel@daniel-nvidia:~/Work/ByteTrack$ python3 tools/train.py -f exps/example/mot/yolox_x_ablation.py -d 1 -b 48 --fp16 -o -c pretrained/yolox_x.pth
2024-12-11 11:37:59 | INFO     | yolox.core.trainer:126 - args: Namespace(batch_size=48, ckpt='pretrained/yolox_x.pth', devices=1, dist_backend='nccl', dist_url=None, exp_file='exps/example/mot/yolox_x_ablation.py', experiment_name='yolox_x_ablation', fp16=True, local_rank=0, machine_rank=0, name=None, num_machines=1, occupy=True, opts=[], resume=False, start_epoch=None)
2024-12-11 11:37:59 | INFO     | yolox.core.trainer:127 - exp value:
╒══════════════════╤════════════════════╕
│ keys             │ values             │
╞══════════════════╪════════════════════╡
│ seed             │ None               │
├──────────────────┼────────────────────┤
│ output_dir       │ './YOLOX_outputs'  │
├──────────────────┼────────────────────┤
│ print_interval   │ 20                 │
├──────────────────┼────────────────────┤
│ eval_interval    │ 5                  │
├──────────────────┼────────────────────┤
│ num_classes      │ 1                  │
├──────────────────┼────────────────────┤
│ depth            │ 1.33               │
├──────────────────┼────────────────────┤
│ width            │ 1.25               │
├──────────────────┼────────────────────┤
│ data_num_workers │ 4                  │
├──────────────────┼────────────────────┤
│ input_size       │ (800, 1440)        │
├──────────────────┼────────────────────┤
│ random_size      │ (18, 32)           │
├──────────────────┼────────────────────┤
│ train_ann        │ 'train.json'       │
├──────────────────┼────────────────────┤
│ val_ann          │ 'val_half.json'    │
├──────────────────┼────────────────────┤
│ degrees          │ 10.0               │
├──────────────────┼────────────────────┤
│ translate        │ 0.1                │
├──────────────────┼────────────────────┤
│ scale            │ (0.1, 2)           │
├──────────────────┼────────────────────┤
│ mscale           │ (0.8, 1.6)         │
├──────────────────┼────────────────────┤
│ shear            │ 2.0                │
├──────────────────┼────────────────────┤
│ perspective      │ 0.0                │
├──────────────────┼────────────────────┤
│ enable_mixup     │ True               │
├──────────────────┼────────────────────┤
│ warmup_epochs    │ 1                  │
├──────────────────┼────────────────────┤
│ max_epoch        │ 80                 │
├──────────────────┼────────────────────┤
│ warmup_lr        │ 0                  │
├──────────────────┼────────────────────┤
│ basic_lr_per_img │ 1.5625e-05         │
├──────────────────┼────────────────────┤
│ scheduler        │ 'yoloxwarmcos'     │
├──────────────────┼────────────────────┤
│ no_aug_epochs    │ 10                 │
├──────────────────┼────────────────────┤
│ min_lr_ratio     │ 0.05               │
├──────────────────┼────────────────────┤
│ ema              │ True               │
├──────────────────┼────────────────────┤
│ weight_decay     │ 0.0005             │
├──────────────────┼────────────────────┤
│ momentum         │ 0.9                │
├──────────────────┼────────────────────┤
│ exp_name         │ 'yolox_x_ablation' │
├──────────────────┼────────────────────┤
│ test_size        │ (800, 1440)        │
├──────────────────┼────────────────────┤
│ test_conf        │ 0.1                │
├──────────────────┼────────────────────┤
│ nmsthre          │ 0.7                │
╘══════════════════╧════════════════════╛
/home/daniel/.local/lib/python3.8/site-packages/torch/functional.py:505: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3490.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
2024-12-11 11:38:02 | INFO     | yolox.core.trainer:132 - Model Summary: Params: 99.00M, Gflops: 793.21
2024-12-11 11:38:06 | INFO     | yolox.core.trainer:291 - loading checkpoint for fine tuning
2024-12-11 11:38:09 | WARNING  | yolox.utils.checkpoint:25 - Shape of head.cls_preds.0.weight in checkpoint is torch.Size([80, 320, 1, 1]), while shape of head.cls_preds.0.weight in model is torch.Size([1, 320, 1, 1]).
2024-12-11 11:38:09 | WARNING  | yolox.utils.checkpoint:25 - Shape of head.cls_preds.0.bias in checkpoint is torch.Size([80]), while shape of head.cls_preds.0.bias in model is torch.Size([1]).
2024-12-11 11:38:09 | WARNING  | yolox.utils.checkpoint:25 - Shape of head.cls_preds.1.weight in checkpoint is torch.Size([80, 320, 1, 1]), while shape of head.cls_preds.1.weight in model is torch.Size([1, 320, 1, 1]).
2024-12-11 11:38:09 | WARNING  | yolox.utils.checkpoint:25 - Shape of head.cls_preds.1.bias in checkpoint is torch.Size([80]), while shape of head.cls_preds.1.bias in model is torch.Size([1]).
2024-12-11 11:38:09 | WARNING  | yolox.utils.checkpoint:25 - Shape of head.cls_preds.2.weight in checkpoint is torch.Size([80, 320, 1, 1]), while shape of head.cls_preds.2.weight in model is torch.Size([1, 320, 1, 1]).
2024-12-11 11:38:09 | WARNING  | yolox.utils.checkpoint:25 - Shape of head.cls_preds.2.bias in checkpoint is torch.Size([80]), while shape of head.cls_preds.2.bias in model is torch.Size([1]).
2024-12-11 11:38:09 | INFO     | yolox.data.datasets.mot:39 - loading annotations into memory...
2024-12-11 11:38:16 | INFO     | yolox.data.datasets.mot:39 - Done (t=6.44s)
2024-12-11 11:38:16 | INFO     | pycocotools.coco:88 - creating index...
2024-12-11 11:38:16 | INFO     | pycocotools.coco:88 - index created!
2024-12-11 11:38:20 | INFO     | yolox.core.trainer:150 - init prefetcher, this might take one minute or less...
2024-12-11 11:38:51 | ERROR    | yolox.core.launch:90 - An error has been caught in function 'launch', process 'MainProcess' (78890), thread 'MainThread' (281473143934992):
Traceback (most recent call last):

  File "/home/daniel/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1135, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
           │    │           │           └ 5.0
           │    │           └ <function Queue.get at 0xffff2415faf0>
           │    └ <queue.Queue object at 0xfffea80c62e0>
           └ <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0xfffea8091d30>
  File "/usr/lib/python3.8/queue.py", line 179, in get
    self.not_empty.wait(remaining)
    │    │         │    └ 4.999996639999154
    │    │         └ <function Condition.wait at 0xffff91dd10d0>
    │    └ <Condition(<unlocked _thread.lock object at 0xfffea80c6420>, 0)>
    └ <queue.Queue object at 0xfffea80c62e0>
  File "/usr/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
            │      │             └ 4.999996639999154
            │      └ <method 'acquire' of '_thread.lock' objects>
            └ <locked _thread.lock object at 0xfffed003a390>
  File "/home/daniel/.local/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
    └ <built-in function _error_if_any_worker_fails>

RuntimeError: DataLoader worker (pid 79134) is killed by signal: Killed.


The above exception was the direct cause of the following exception:


Traceback (most recent call last):

  File "tools/train.py", line 114, in <module>
    launch(
    └ <function launch at 0xffff21e639d0>

> File "/home/daniel/Work/ByteTrack/yolox/core/launch.py", line 90, in launch
    main_func(*args)
    │          └ (╒══════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════...
    └ <function main at 0xffff22f4d160>

  File "tools/train.py", line 100, in main
    trainer.train()
    │       └ <function Trainer.train at 0xfffed70d8940>
    └ <yolox.core.trainer.Trainer object at 0xffff22f4efa0>

  File "/home/daniel/Work/ByteTrack/yolox/core/trainer.py", line 70, in train
    self.before_train()
    │    └ <function Trainer.before_train at 0xffff22f21e50>
    └ <yolox.core.trainer.Trainer object at 0xffff22f4efa0>

  File "/home/daniel/Work/ByteTrack/yolox/core/trainer.py", line 151, in before_train
    self.prefetcher = DataPrefetcher(self.train_loader)
    │                 │              │    └ <yolox.data.dataloading.DataLoader object at 0xfffea8091dc0>
    │                 │              └ <yolox.core.trainer.Trainer object at 0xffff22f4efa0>
    │                 └ <class 'yolox.data.data_prefetcher.DataPrefetcher'>
    └ <yolox.core.trainer.Trainer object at 0xffff22f4efa0>

  File "/home/daniel/Work/ByteTrack/yolox/data/data_prefetcher.py", line 26, in __init__
    self.preload()
    │    └ <function DataPrefetcher.preload at 0xfffed70d1e50>
    └ <yolox.data.data_prefetcher.DataPrefetcher object at 0xfffea80917f0>

  File "/home/daniel/Work/ByteTrack/yolox/data/data_prefetcher.py", line 30, in preload
    self.next_input, self.next_target, _, _ = next(self.loader)
    │                │                             │    └ <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0xfffea8091d30>
    │                │                             └ <yolox.data.data_prefetcher.DataPrefetcher object at 0xfffea80917f0>
    │                └ <yolox.data.data_prefetcher.DataPrefetcher object at 0xfffea80917f0>
    └ <yolox.data.data_prefetcher.DataPrefetcher object at 0xfffea80917f0>

  File "/home/daniel/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
           │    └ <function _MultiProcessingDataLoaderIter._next_data at 0xffff240741f0>
           └ <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0xfffea8091d30>
  File "/home/daniel/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1331, in _next_data
    idx, data = self._get_data()
                │    └ <function _MultiProcessingDataLoaderIter._get_data at 0xffff24074160>
                └ <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0xfffea8091d30>
  File "/home/daniel/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1287, in _get_data
    success, data = self._try_get_data()
    │               │    └ <function _MultiProcessingDataLoaderIter._try_get_data at 0xffff240740d0>
    │               └ <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0xfffea8091d30>
    └ False
  File "/home/daniel/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1148, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
                                                                                  └ '79134'

RuntimeError: DataLoader worker (pid(s) 79134) exited unexpectedly

lida2003 avatar Dec 11 '24 03:12 lida2003

I've encountered the same error message. The program suddenly terminates halfway through its execution. I'm wondering if you've managed to resolve it? If so, could you share how you fixed it?

Eenchanted avatar Feb 21 '25 08:02 Eenchanted

@Eenchanted The repo seems no longer maintained. We have solved some issue, but not this one on JP5.1.4 and we are trying on JP6.2.

https://github.com/SnapDragonfly/ByteTrack FYI

lida2003 avatar Feb 21 '25 10:02 lida2003

I have the same issue. Have anyone solved it?

panagiotamoraiti avatar Apr 16 '25 15:04 panagiotamoraiti

@panagiotamoraiti Please check links above, which might solve your issue. BTW increase your swap memory.

lida2003 avatar Apr 17 '25 05:04 lida2003