YOLOv6
YOLOv6 copied to clipboard
Training Failure
Before Asking
-
[X] I have read the README carefully. 我已经仔细阅读了README上的操作指引。
-
[X] I want to train my custom dataset, and I have read the tutorials for training your custom data carefully and organize my dataset correctly; (FYI: We recommand you to apply the config files of xx_finetune.py.) 我想训练自定义数据集,我已经仔细阅读了训练自定义数据的教程,以及按照正确的目录结构存放数据集。(FYI: 我们推荐使用xx_finetune.py等配置文件训练自定义数据集。)
-
[X] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
Search before asking
- [X] I have searched the YOLOv6 issues and found no similar questions.
Question
when i run training for 10 epochs , i had this error in epoch 5 :
Epoch iou_loss dfl_loss cls_loss
0%| | 0/1566 [00:00<?, ?it/s]
5/9 0.5002 0 1.839: 0%| | 0/1566 [00:00<?, ?it/
5/9 0.5002 0 1.839: 0%| | 1/1566 [00:00<10:14, C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:115: block: [29,0,0], thread: [0,0,0] Assertion input_val >= zero && input_val <= one
failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:115: block: [29,0,0], thread: [1,0,0] Assertion input_val >= zero && input_val <= one
failed.
5/9 0.5002 0 1.839: 0%| | 1/1566 [00:01<26:59, ERROR in training steps. ERROR in training loop or eval/save model. Traceback (most recent call last): File "C:\Users\MohamedIMAM\notebooks\Yolov6_test\YOLOv6\yolov6\core\engine.py", line 99, in train self.train_in_loop(self.epoch) File "C:\Users\MohamedIMAM\notebooks\Yolov6_test\YOLOv6\yolov6\core\engine.py", line 113, in train_in_loop self.train_in_steps(epoch_num, self.step) File "C:\Users\MohamedIMAM\notebooks\Yolov6_test\YOLOv6\yolov6\core\engine.py", line 142, in train_in_steps total_loss, loss_items = self.compute_loss(preds, targets, epoch_num, step_num) File "C:\Users\MohamedIMAM\notebooks\Yolov6_test\YOLOv6\yolov6\models\loss.py", line 112, in call loss_iou, loss_dfl = self.bbox_loss(pred_distri, pred_bboxes, anchor_points_s, target_bboxes, File "C:\Users\MohamedIMAM\anaconda3\envs\pytorch_env\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "C:\Users\MohamedIMAM\notebooks\Yolov6_test\YOLOv6\yolov6\models\loss.py", line 167, in forward if num_pos > 0: RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\MohamedIMAM\notebooks\Yolov6_test\YOLOv6\tools\train.py", line 126, in
Additional
No response
Sorry for the inconvenience, we will try to reproduce this problem and fix it.
the same problem
(act): SiLU(inplace=True)
)
)
(cls_preds): ModuleList(
(0): Conv2d(32, 80, kernel_size=(1, 1), stride=(1, 1))
(1): Conv2d(64, 80, kernel_size=(1, 1), stride=(1, 1))
(2): Conv2d(128, 80, kernel_size=(1, 1), stride=(1, 1))
)
(reg_preds): ModuleList(
(0): Conv2d(32, 4, kernel_size=(1, 1), stride=(1, 1))
(1): Conv2d(64, 4, kernel_size=(1, 1), stride=(1, 1))
(2): Conv2d(128, 4, kernel_size=(1, 1), stride=(1, 1))
)
) ) Training start...
Epoch iou_loss dfl_loss cls_loss
0%| | 0/14786 [00:00<?, ?it/s] /home/shuhong/miniconda3/envs/torch112/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2894.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [32,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [33,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [34,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [35,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [36,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [37,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [38,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [39,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [40,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [41,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [42,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [43,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [44,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [45,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [46,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [47,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [48,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [49,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [50,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [51,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [52,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [53,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [54,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [55,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [56,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [57,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [58,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [59,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [60,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [61,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [62,0,0] Assertion input_val >= zero && input_val <= one
failed.
../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [63,0,0] Assertion input_val >= zero && input_val <= one
failed.
0%| | 0/14786 [00:01<?, ?it/s]
ERROR in training steps. ERROR in training loop or eval/save model. Traceback (most recent call last): File "/home/shuhong/work/YOLOv6/yolov6/core/engine.py", line 99, in train self.train_in_loop(self.epoch) File "/home/shuhong/work/YOLOv6/yolov6/core/engine.py", line 113, in train_in_loop self.train_in_steps(epoch_num, self.step) File "/home/shuhong/work/YOLOv6/yolov6/core/engine.py", line 142, in train_in_steps total_loss, loss_items = self.compute_loss(preds, targets, epoch_num, step_num) File "/home/shuhong/work/YOLOv6/yolov6/models/loss.py", line 106, in call loss_cls = self.varifocal_loss(pred_scores, target_scores, one_hot_label) File "/home/shuhong/miniconda3/envs/torch112/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/shuhong/work/YOLOv6/yolov6/models/loss.py", line 149, in forward loss = (F.binary_cross_entropy(pred_score.float(), gt_score.float(), reduction='none') * weight).sum() File "/home/shuhong/miniconda3/envs/torch112/lib/python3.8/site-packages/torch/nn/functional.py", line 3083, in binary_cross_entropy return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum) RuntimeError: CUDA error: device-side assert triggered
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tools/train.py", line 126, in
@MedImam @guoshuhong Hi, can you provide us with your training commands? Besides, if you git pull the latest code from master branch, the same error messages still happen?
python tools/train.py --batch 16 --conf configs/yolov6n.py --data data/coco.yaml --epoch 400 --name yolov6n_coco
@MedImam @guoshuhong Hi, can you provide us with your training commands? Besides, if you git pull the latest code from master branch, the same error messages still happen? I just try to train coco with single gpu python tools/train.py --batch 16 --conf configs/yolov6n.py --data data/coco.yaml --epoch 400 --name yolov6n_coco
@MedImam @guoshuhong Hi, can you provide us with your training commands? Besides, if you git pull the latest code from master branch, the same error messages still happen?
Suddenly, This problem disappeared, I don't know why - -
@MedImam @guoshuhong Hi, can you provide us with your training commands? Besides, if you git pull the latest code from master branch, the same error messages still happen?
!python tools/train.py --batch 1 --epochs 10 --conf configs/yolov6n_finetune.py --data data/dataset.yaml --device 0
This is the training command that i’m using : !python tools/train.py --batch 1 --epochs 10 --conf configs/yolov6n_finetune.py --data data/dataset.yaml --device 0 Cordially. Mohamed IMAMPhd. Candidate in AI/ML/DL & Computer Vision. LinkedIn : https://ma.linkedin.com/in/mohamed-imam-56a81b165 Envoyé à partir de Courrier pour Windows De : ChilicyyEnvoyé le :dimanche 9 octobre 2022 04:45À : meituan/YOLOv6Cc : MedImam; MentionObjet :Re: [meituan/YOLOv6] Training Failure (Issue ***@***.*** @guoshuhong Hi, can you provide us with your training commands? Besides, if you git pull the latest code from master branch, the same error messages happen?—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
@MedImam It may due to the batchsize you use is so small that doesn't match the learning rate. Can you try to increase your batchsize as large as possible and use the latest code from main branch for training again?
I have 7500 pictures as the data set, and then I will set batch_size to 16, workers to 2, conf_ file to../config/yolov6m_ inetune.py, after training 400 epochs, found that map0.5 only reached 0.45, so I modified lr0 to 0.12 and lrf to 0.0032 in configs. As usual, when the epoch was 6, the above errors occurred steadily.It is worth mentioning that I hope to converge quickly at the beginning of training, so here I reverse the values of lr0 and lrf, and this error will occur after training again.
This error also happen when I start the qat distill training.
- Number class = 1
- Config: similar to
configs/repopt/yolov6s_opt_qat.py
- Dataset:
-
Train: Final numbers of valid images: 64094/ labels: 64094.
-
Val: Final numbers of valid images: 2693/ labels: 2693.
-
Command
CUDA_LAUNCH_BLOCKING=1 PYTHONWARNINGS="ignore" python tools/train.py --data data/custom-data.yaml --name yolov6s-repopt-custom-data-qat --conf configs/repopt/yolov6s_opt_qat-custom-data.py --quant --distill --distill_feat --batch 32 --workers 14 --epochs 10 --teacher_model_path runs/train/yolov6s-repopt-custom-data/weights/best_ckpt.pt --device 0 --check-images --check-labels
Log
0%| | 0/2003 [00:00<?, ?it/s] /opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [3,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [4,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [5,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [6,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [7,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [8,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [9,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [10,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [11,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [12,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [13,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [14,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [15,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [16,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [17,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [18,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [19,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [20,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [21,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [22,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [23,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [24,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [25,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [26,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [27,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [28,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [29,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [30,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [31,0,0] Assertion `input_val >= zero && input_val <= one` failed.
0%| | 0/2003 [00:01<?, ?it/s]
WARNING: Logging before flag parsing goes to stderr.
E1020 02:01:00.629523 139757667850048 engine.py:116] ERROR in training steps.
E1020 02:01:00.629653 139757667850048 engine.py:103] ERROR in training loop or eval/save model.
Traceback (most recent call last):
File "/mnt/raid1/haritsah/projects/custom-object/object_detection/yolov6-trainer/yolov6/core/engine.py", line 99, in train
self.train_in_loop(self.epoch)
File "/mnt/raid1/haritsah/projects/custom-object/object_detection/yolov6-trainer/yolov6/core/engine.py", line 113, in train_in_loop
self.train_in_steps(epoch_num, self.step)
File "/mnt/raid1/haritsah/projects/custom-object/object_detection/yolov6-trainer/yolov6/core/engine.py", line 139, in train_in_steps
total_loss, loss_items = self.compute_loss_distill(preds, t_preds, s_featmaps, t_featmaps, targets, \
File "/mnt/raid1/haritsah/projects/custom-object/object_detection/yolov6-trainer/yolov6/models/loss_distill.py", line 124, in __call__
loss_cls = self.varifocal_loss(pred_scores, target_scores, one_hot_label)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1111, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/raid1/haritsah/projects/custom-object/object_detection/yolov6-trainer/yolov6/models/loss_distill.py", line 219, in forward
loss = (F.binary_cross_entropy(pred_score.float(), gt_score.float(), reduction='none') * weight).sum()
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 3030, in binary_cross_entropy
return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
RuntimeError: CUDA error: device-side assert triggered
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tools/train.py", line 126, in <module>
main(args)
File "tools/train.py", line 116, in main
trainer.train()
File "/mnt/raid1/haritsah/projects/custom-object/object_detection/yolov6-trainer/yolov6/core/engine.py", line 106, in train
self.train_after_loop()
File "/mnt/raid1/haritsah/projects/custom-object/object_detection/yolov6-trainer/yolov6/core/engine.py", line 297, in train_after_loop
torch.cuda.empty_cache()
File "/opt/conda/lib/python3.8/site-packages/torch/cuda/memory.py", line 114, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Exception raised from createEvent at /opt/pytorch/pytorch/aten/src/ATen/cuda/CUDAEvent.h:174 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f1b7319e1dc in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe379dd (0x7f1b740649dd in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe3b426 (0x7f1b74068426 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x43d67c (0x7f1ba7d1367c in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f1b73187035 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x33a6c9 (0x7f1ba7c106c9 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x646f92 (0x7f1ba7f1cf92 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2f5 (0x7f1ba7f1d315 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #20: __libc_start_main + 0xf3 (0x7f1bde2800b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
Quick update, the preds output in preds, s_featmaps = self.model(images)
are tensor of nans. Any idea how this might happen?
did anyone solved this problem? i am getting the same error which seems due to the loss function as i used CUDA_LAUNCH_BLOCKING=1 that showed me from where the error is coming, which is ( return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum) RuntimeError: CUDA error: device-side assert triggered ) the woirred thing if i use the cpu i do not get this error but with the gpu yes, my command fro training is , python train.py --batch 8 --conf configs/yolov6s_finetune.py --data-path data/dataset.yaml --device 0 --epochs 2 --eval-interval 2, note i used different batch sizes and the error still there
@AhmedShahhatAl , I'm not sure about it, but I was able to solve it by using extra class. My model should detect 1 object but I add extra class to make it 2 class object detection (see my related issue above). How many class that you have?
@haritsahm i have 2 classes, did you just modified the num of classes without adding a label for it ? or how you did that exactly
@AhmedShahhatAl I didn't modified the code. I use real labels since it will affect the detection result
I'm running a dual-gpu setup. I encountered this issue on my 1650 super but was able to successfully train on my 1070.
I'm following this guide: https://github.com/meituan/YOLOv6/blob/main/docs/tutorial_repopt.md
Maybe your set wrong number of classes in dataset.yaml
, or classes names list
's length not equal to number of classes. After fix the upper problem, my trainning script can successfully run.
Encountered the same issue and I finally fixed it by adding target_scores = torch.clamp(target_scores, min=0, max=1)
after line 81 in yolov6/assigners/tal_assigner.py
. Took me a while to figure it out, but somehow the target_scores values were out of bounds for my training dataset.
I encouter the same error while training a yolov6l on a 10k images, batch 20 or 32 on GPU and at the very same epoch (24) it fails training and shows this error : Assertion target_val >= zero && target_val <= one
failed.
24/499 0.001991 0.1656 0.2771 0.4584: 16%|█▌ | 81/500 [00:19<01:41, 4.11it/s
ERROR in training steps.
ERROR in training loop or eval/save model.... I thought it was due to some non normalized bounding boxes but i checekd all of them by a script and they were all good.
I encountered the same error. I SOLVED this problem. The following information may be helpful for you.
Problem Reproduction
When I use CUDA to train, I got this stacktrace:
Traceback (most recent call last):
File "tools/train.py", line 145, in <module>
main(args)
File "tools/train.py", line 135, in main
trainer.train()
File "......\yolov6\core\engine.py", line 129, in train
self.train_after_loop()
File "......\yolov6\core\engine.py", line 358, in train_after_loop
torch.cuda.empty_cache()
File "......\lib\site-packages\torch\cuda\memory.py", line 125, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered
Every time when my training get into a determined epoch (say, epoch 4), this error occurred.
Then I tried to use CPU to train to see the detailed error message. Using this method, I learned that IndexError
caused this failure, that one label in my dataset has ID 33 (ID starts from 0, so it should be 32) while I only has 33 classes.
My Solution
- Check your labels and classes. Make sure that you have the right number of classes and class id.
- Correct the mistakes you found above in your dataset.
- Clear the training cache completely.
- Restart your training.