ultralytics
ultralytics copied to clipboard
nan report in box_class cls_class and dfl_loss when train custom dataset
Search before asking
- [X] I have searched the YOLOv8 issues and found no similar bug report.
YOLOv8 Component
Training
Bug
Hello, I am newbie in computer vision and I just started to try the new version yolov8 and I get some error when take the result. I seem like something wrong but I don't know how to fix it. Can you give me some suggest?
Environment
-YOLOv8n -CUDA: 11.6 -Ultralytics YOLOv8.0.4 -OS: Windows 10
Minimal Reproducible Example
No response
Additional
No response
Are you willing to submit a PR?
- [ ] Yes I'd like to help by submitting a PR!
@classico09 can you share your training command?
@classico09 can you share your training command?
Here is my command: yolo task=detect mode=train model=yolov8n.yaml data="./data/dataset.yaml" epochs=100 batch=20 device='0' workers=4
@classico09 hi. Is your performance fine with yolov5? Can you run the same command with v5loader=True
flag to check.
If that doesn't work then probably your dataset might have some problem
@classico09 hi. Is your performance fine with yolov5? Can you run the same command with
v5loader=True
flag to check. If that doesn't work then probably your dataset might have some problem
Thank you. I tried it but it still doesn't work. I tried with the yolov5 model in the yolov5 repositories and it work so I think it is not because of the dataset.
@classico09 hi. Is your performance fine with yolov5? Can you run the same command with
v5loader=True
flag to check. If that doesn't work then probably your dataset might have some problem
Hello, I met same question. I successfully completed the training using the environment of yolov5, (MAP is 0.907). Based on the yolov5 environment, I quickly completed the installation using pip install ultralytics according to the documentation (I want to train yolov8 and compare yolov5 to see the effect). https://github.com/ultralytics/ultralytics/issues/283
yolo task=init --config-name helmethyp.yaml --config-path /nfs/volume-622-1/lanzhixiong/project/smoking/code/yolov8/
yolo task=detect mode=train model=yolov8n.yaml data=/nfs/volume-622-1/lanzhixiong/project/smoking/code/yolov8/helmet640.yaml device=0 batch=20 workers=0 --config-name=helmethyp.yaml --config-path=/nfs/volume-622-1/lanzhixiong/project/smoking/code/yolov8
I also encountered the same problem, but I found that the problem could be solved by turning down the batch, but I don't know why this is so, and the training is very slow, and the GPU utilization rate is very low
@classico09 hi. Is your performance fine with yolov5? Can you run the same command with
v5loader=True
flag to check. If that doesn't work then probably your dataset might have some problem
I have the same/similar problem. When I run the same command with v5loader=True
I get: KeyError: 'masks'. However, I can run the same dataset with v5loader=False but get very bad results (high loss/no predictions). I run the same dataset with Yolov5 repository and I get good results.
Hi all. Your issue might've been solved in a PR by @Laughing-q and will be available in the package in few hours.
@pepijnob @duynguyen1907 @jiyuwangbupt hey guys, can you try to replace the following line to self.optimizer.step()
? and restart training to check if the losses are good. Thanks!
https://github.com/ultralytics/ultralytics/blob/2bc36d97ce7f0bdc0018a783ba56d3de7f0c0518/ultralytics/yolo/engine/trainer.py#L410
@Laughing-q I tried your suggestion in version 8.0.4 and in 8.0.5 and both times my loss went to nan in the first epoch. When I just update to 8.0.5 without your suggestion I get the same as before with the loss not going down (on the same dataset where yolov5 did work).
I am experimenting same logs with the defualt command:
yolo task=detect mode=train model=yolov8s.pt batch=4
I mean, with coco128.yaml, just to do some testings and same results are gotten:
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
1/100 1.51G nan nan nan 71 640: 100%|██████████| 32/32 [00:36<00:00, 1.13s/it]
/home/henry/.local/bin/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 16/16 [00:08<00:00, 1.94it/s]
all 128 929 0 0 0 0
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
2/100 2.13G nan nan nan 51 640: 100%|██████████| 32/32 [00:34<00:00, 1.07s/it]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 16/16 [00:08<00:00, 1.92it/s]
all 128 929 0 0 0 0
I also meet same problem, seems cls_loss suddenly appear NaN, and also all loss is NaN.
@nikbobo @hdnh2006 hi guys, may I ask if there're corrupt labels in your datasets?
actually I just found that we got a mismatch issue if there's corrupt label in datasets, so if you guys got corrupt labels then your issues could be solved by this PR. https://github.com/ultralytics/ultralytics/pull/460
the nan loss issue has been solved in this PR #490, which we'll merge it later today. :)
@Laughing-q I am still getting nan in my training. It seems for validation is solved:
After running pip install --upgrade ultralytics
I get the following:
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: ultralytics in ********/.virtualenvs/ultralytics/lib/python3.8/site-packages (8.0.10)
@nikbobo @hdnh2006 hi guys, may I ask if there're corrupt labels in your datasets?
actually I just found that we got a mismatch issue if there's corrupt label in datasets, so if you guys got corrupt labels then your issues could be solved by this PR. #460
My dataset is coco128.yaml
, so I am just using the default parameters for testing.
@hdnh2006 it's not merged yet. The update will be available later today
@AyushExel Thanks Ayush, you are awesome as always!!
I have this issue even on newest update
$ yolo detect train data=coco128.yaml model=yolov8n.pt epochs=100 imgsz=640 batch=4
Ultralytics YOLOv8.0.11 🚀 Python-3.10.6 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1650 Ti, 3912MiB)
yolo/engine/trainer: task=detect, mode=train, model=yolov8n.pt, data=coco128.yaml, epochs=100, patience=50, batch=4, imgsz=640, save=True, cache=False, device=, workers=8, project=None, name=None, exist_ok=False, pretrained=False, optimizer=SGD, verbose=False, seed=0, deterministic=True, single_cls=False, image_weights=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, hide_labels=False, hide_conf=False, vid_stride=1, line_thickness=3, visualize=False, augment=False, agnostic_nms=False, retina_masks=False, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=17, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, fl_gamma=0.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, cfg=None, hydra={'output_subdir': None, 'run': {'dir': '.'}}, v5loader=False, save_dir=runs/detect/train47
from n params module arguments
0 -1 1 464 ultralytics.nn.modules.Conv [3, 16, 3, 2]
1 -1 1 4672 ultralytics.nn.modules.Conv [16, 32, 3, 2]
2 -1 1 7360 ultralytics.nn.modules.C2f [32, 32, 1, True]
3 -1 1 18560 ultralytics.nn.modules.Conv [32, 64, 3, 2]
4 -1 2 49664 ultralytics.nn.modules.C2f [64, 64, 2, True]
5 -1 1 73984 ultralytics.nn.modules.Conv [64, 128, 3, 2]
6 -1 2 197632 ultralytics.nn.modules.C2f [128, 128, 2, True]
7 -1 1 295424 ultralytics.nn.modules.Conv [128, 256, 3, 2]
8 -1 1 460288 ultralytics.nn.modules.C2f [256, 256, 1, True]
9 -1 1 164608 ultralytics.nn.modules.SPPF [256, 256, 5]
10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
11 [-1, 6] 1 0 ultralytics.nn.modules.Concat [1]
12 -1 1 148224 ultralytics.nn.modules.C2f [384, 128, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 4] 1 0 ultralytics.nn.modules.Concat [1]
15 -1 1 37248 ultralytics.nn.modules.C2f [192, 64, 1]
16 -1 1 36992 ultralytics.nn.modules.Conv [64, 64, 3, 2]
17 [-1, 12] 1 0 ultralytics.nn.modules.Concat [1]
18 -1 1 123648 ultralytics.nn.modules.C2f [192, 128, 1]
19 -1 1 147712 ultralytics.nn.modules.Conv [128, 128, 3, 2]
20 [-1, 9] 1 0 ultralytics.nn.modules.Concat [1]
21 -1 1 493056 ultralytics.nn.modules.C2f [384, 256, 1]
22 [15, 18, 21] 1 897664 ultralytics.nn.modules.Detect [80, [64, 128, 256]]
Model summary: 225 layers, 3157200 parameters, 3157184 gradients, 8.9 GFLOPs
Transferred 355/355 items from pretrained weights
optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.0005), 63 bias
train: Scanning /home/karol/Projekty/yolov8/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
val: Scanning /home/karol/Projekty/yolov8/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
Image sizes 640 train, 640 val
Using 4 dataloader workers
Logging results to runs/detect/train47
Starting training for 100 epochs...
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
1/100 1.07G nan nan nan 71 640: 100%|██████████| 32/32 [00:10<00:00, 3.04it/s]
/home/karol/Projekty/yolov8/yolov8-venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 16/16 [00:02<00:00, 5.95it/s]
all 128 929 0.697 0.0679 0.0821 0.0503
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
2/100 1.69G nan nan nan 51 640: 100%|██████████| 32/32 [00:09<00:00, 3.47it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 16/16 [00:02<00:00, 6.10it/s]
all 128 929 0.672 0.0729 0.082 0.051
I try on nvidia drivers 525 (CUDA 12), 470 (CUDA 11.4), with ultralytics docker, etc My card is GTX 1650 Ti Mobile Ubuntu 22 Try on generic command from CLI and python (only down batch size, because have only 4GB card memory) Sems better when I down batch size to 1-3 or switch to CPU.
Same problem with the latest version (8.0.17)
When lowering the batch size the losses seems to be working but the model is not learning properly.
Same problem with the latest version (8.0.17)
When lowering the batch size the losses seems to be working but the model is not learning properly.
In my experiment, I sometimes will be fixed by replacing the dataset, updating the PyTorch to latest version, or changing the gpu types. It seems not a model problem but is a AMP problem. If you don't want to try the above operation, you can try close AMP and only using FP32 by force all in FP32 and close autocast.
Same problem with the latest version (8.0.17) When lowering the batch size the losses seems to be working but the model is not learning properly. Batch 16 Batch 8
In my experiment, I sometimes will be fixed by replacing the dataset, updating the PyTorch to latest version, or changing the gpu types. It seems not a model problem but is a AMP problem. If you don't want to try the above operation, you can try close AMP and only using FP32 by force all in FP32 and close autocast.
You are right the problem is with the GPU. It works on the 3090 and doesn't with the 1650. Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3090, 24265MiB) -- Working
Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1650, 3912MiB) -- Not working
Same problem with the latest version (8.0.17) When lowering the batch size the losses seems to be working but the model is not learning properly. Batch 16 Batch 8
In my experiment, I sometimes will be fixed by replacing the dataset, updating the PyTorch to latest version, or changing the gpu types. It seems not a model problem but is a AMP problem. If you don't want to try the above operation, you can try close AMP and only using FP32 by force all in FP32 and close autocast.
You are right the problem is with the GPU. It works on the 3090 and doesn't with the 1650. Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3090, 24265MiB) -- Working
Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1650, 3912MiB) -- Not working
This is totally true, I have a RTX2060 Super and I get the following logs:
Meanwhile in my laptop with GTX 1650, the logs are like following (with an important warning):
I have the same versions of PyTorch in both computers:
RTX2060:
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.13.1+cu117'
GTX1650:
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.13.1+cu117'
New important EDIT:
If I try the trianing on my laptop with GTX1650 but using the CPU, I don't get any nan
values:
So clearly, there's a compatibilty problem with this GPU.
I think this is the same problem in yolov5: https://github.com/ultralytics/yolov5/issues/7908 Must check with cuda10, but it required older system.
I had the same errors propagate to yolov8 and yolov5, but I found a similar bug report for yolov5 that suggested disabling AMP by amp=False in train.py, which fixes box_loss and obj_loss equating to nan.
The other suggested fix for validation not working was that in train.py validation uses half accuracy and is half=amp in the validator() function (val in this thread) but by force assigning it half=False, it fixed my problem for training on yolov5 and training has resumed as usual using CUDA 11.7 with a Nvidia T1200 Laptop GPU (Compute Capability 7+).
Could perhaps be a problem with amp from CUDA since I even saw users in this thread have an issue with amp in CUDA 11.x and saw it solved when they reverted to CUDA 10.x.
Perhaps mirroring the fix found in this thread might help? Can't really find the equivalent variables to change in train.py, and was wondering where they were moved to in v8.
Thread for reference: https://github.com/ultralytics/yolov5/issues/7908
👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.
For additional resources and information, please see the links below:
- Docs: https://docs.ultralytics.com
- HUB: https://hub.ultralytics.com
- Community: https://community.ultralytics.com
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLO 🚀 and Vision AI ⭐
Any solution to run YOLOV8 or YOLOV5 on NVIDIA GTX 1650
because still i am facing same error
@mithilnettyfy hey there! Thank you for reaching out to us. We apologize for the inconvenience you have faced while training YOLOv8 on your NVIDIA GTX 1650 GPU.
The issue you are facing could be related to a compatibility issue with your GPU or with the use of Automatic Mixed Precision (AMP). We recommend trying the following solutions:
-
Disable AMP by setting amp=False in train.py while training the YOLOv8 model.
-
Force all variables to run in FP32 instead of using both FP16 and FP32 by disabling autocast. You can do this by setting autocast=False in train.py while training the model.
If neither of these solutions work, we recommend checking the compatibility of your GTX 1650 GPU with the CUDA version you are using. Some users have reported issues with AMP in CUDA 11.x and have solved the problem by reverting back to CUDA 10.x.
Please let us know if this helps resolve your issue or if you have any further questions.
Hey @glenn-jocher Thank you so much for helping to resolve this issue. My program is working perfectly but your second solution is not working. Could you please describe what exactly is the second point
There is no any argument autocast=FALSE https://docs.ultralytics.com/modes/train/#arguments
- Force all variables to run in FP32 instead of using both FP16 and FP32 by disabling autocast. You can do this by setting autocast=False in train.py while training the model.
Thank you in advance for your help. I appreciate it.
still getting 0 value on Box(P R mAP50 mAP50-95)
Hello, I just faced the same problem in the platform AutoDL using YOLOv5. I solved it by cloning the latest version of YOLOv5 rather than using the YOLOv5 provided by the platform. I hope this tip can help you.