ultralytics icon indicating copy to clipboard operation
ultralytics copied to clipboard

nan report in box_class cls_class and dfl_loss when train custom dataset

Open duynguyen1907 opened this issue 2 years ago • 19 comments

Search before asking

  • [X] I have searched the YOLOv8 issues and found no similar bug report.

YOLOv8 Component

Training

Bug

Hello, I am newbie in computer vision and I just started to try the new version yolov8 and I get some error when take the result. I seem like something wrong but I don't know how to fix it. Can you give me some suggest?

bug

Environment

-YOLOv8n -CUDA: 11.6 -Ultralytics YOLOv8.0.4 -OS: Windows 10

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

  • [ ] Yes I'd like to help by submitting a PR!

duynguyen1907 avatar Jan 12 '23 06:01 duynguyen1907

@classico09 can you share your training command?

Laughing-q avatar Jan 12 '23 06:01 Laughing-q

@classico09 can you share your training command?

Here is my command: yolo task=detect mode=train model=yolov8n.yaml data="./data/dataset.yaml" epochs=100 batch=20 device='0' workers=4

duynguyen1907 avatar Jan 12 '23 06:01 duynguyen1907

@classico09 hi. Is your performance fine with yolov5? Can you run the same command with v5loader=True flag to check. If that doesn't work then probably your dataset might have some problem

AyushExel avatar Jan 12 '23 07:01 AyushExel

@classico09 hi. Is your performance fine with yolov5? Can you run the same command with v5loader=True flag to check. If that doesn't work then probably your dataset might have some problem

Thank you. I tried it but it still doesn't work. I tried with the yolov5 model in the yolov5 repositories and it work so I think it is not because of the dataset.

duynguyen1907 avatar Jan 12 '23 08:01 duynguyen1907

@classico09 hi. Is your performance fine with yolov5? Can you run the same command with v5loader=True flag to check. If that doesn't work then probably your dataset might have some problem

Hello, I met same question. I successfully completed the training using the environment of yolov5, (MAP is 0.907). Based on the yolov5 environment, I quickly completed the installation using pip install ultralytics according to the documentation (I want to train yolov8 and compare yolov5 to see the effect). https://github.com/ultralytics/ultralytics/issues/283

yolo task=init --config-name helmethyp.yaml --config-path /nfs/volume-622-1/lanzhixiong/project/smoking/code/yolov8/
yolo task=detect mode=train model=yolov8n.yaml data=/nfs/volume-622-1/lanzhixiong/project/smoking/code/yolov8/helmet640.yaml device=0 batch=20 workers=0 --config-name=helmethyp.yaml --config-path=/nfs/volume-622-1/lanzhixiong/project/smoking/code/yolov8

jiyuwangbupt avatar Jan 12 '23 09:01 jiyuwangbupt

I also encountered the same problem, but I found that the problem could be solved by turning down the batch, but I don't know why this is so, and the training is very slow, and the GPU utilization rate is very low

M15-3080 avatar Jan 12 '23 09:01 M15-3080

@classico09 hi. Is your performance fine with yolov5? Can you run the same command with v5loader=True flag to check. If that doesn't work then probably your dataset might have some problem

I have the same/similar problem. When I run the same command with v5loader=True I get: KeyError: 'masks'. However, I can run the same dataset with v5loader=False but get very bad results (high loss/no predictions). I run the same dataset with Yolov5 repository and I get good results.

pepijnob avatar Jan 12 '23 13:01 pepijnob

Hi all. Your issue might've been solved in a PR by @Laughing-q and will be available in the package in few hours.

AyushExel avatar Jan 12 '23 13:01 AyushExel

@pepijnob @duynguyen1907 @jiyuwangbupt hey guys, can you try to replace the following line to self.optimizer.step()? and restart training to check if the losses are good. Thanks! https://github.com/ultralytics/ultralytics/blob/2bc36d97ce7f0bdc0018a783ba56d3de7f0c0518/ultralytics/yolo/engine/trainer.py#L410

Laughing-q avatar Jan 13 '23 06:01 Laughing-q

@Laughing-q I tried your suggestion in version 8.0.4 and in 8.0.5 and both times my loss went to nan in the first epoch. When I just update to 8.0.5 without your suggestion I get the same as before with the loss not going down (on the same dataset where yolov5 did work).

pepijnob avatar Jan 13 '23 10:01 pepijnob

I am experimenting same logs with the defualt command: yolo task=detect mode=train model=yolov8s.pt batch=4

I mean, with coco128.yaml, just to do some testings and same results are gotten:

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      1/100      1.51G        nan        nan        nan         71        640: 100%|██████████| 32/32 [00:36<00:00,  1.13s/it]
/home/henry/.local/bin/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 16/16 [00:08<00:00,  1.94it/s]
                   all        128        929          0          0          0          0

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      2/100      2.13G        nan        nan        nan         51        640: 100%|██████████| 32/32 [00:34<00:00,  1.07s/it]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 16/16 [00:08<00:00,  1.92it/s]
                   all        128        929          0          0          0          0

hdnh2006 avatar Jan 13 '23 12:01 hdnh2006

I also meet same problem, seems cls_loss suddenly appear NaN, and also all loss is NaN.

nikbobo avatar Jan 15 '23 07:01 nikbobo

@nikbobo @hdnh2006 hi guys, may I ask if there're corrupt labels in your datasets? oRFtMgd2vI actually I just found that we got a mismatch issue if there's corrupt label in datasets, so if you guys got corrupt labels then your issues could be solved by this PR. https://github.com/ultralytics/ultralytics/pull/460

Laughing-q avatar Jan 19 '23 06:01 Laughing-q

the nan loss issue has been solved in this PR #490, which we'll merge it later today. :)

Laughing-q avatar Jan 19 '23 07:01 Laughing-q

@Laughing-q I am still getting nan in my training. It seems for validation is solved: image

After running pip install --upgrade ultralytics I get the following:

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: ultralytics in ********/.virtualenvs/ultralytics/lib/python3.8/site-packages (8.0.10)

hdnh2006 avatar Jan 19 '23 09:01 hdnh2006

@nikbobo @hdnh2006 hi guys, may I ask if there're corrupt labels in your datasets? oRFtMgd2vI actually I just found that we got a mismatch issue if there's corrupt label in datasets, so if you guys got corrupt labels then your issues could be solved by this PR. #460

My dataset is coco128.yaml, so I am just using the default parameters for testing.

hdnh2006 avatar Jan 19 '23 09:01 hdnh2006

@hdnh2006 it's not merged yet. The update will be available later today

AyushExel avatar Jan 19 '23 09:01 AyushExel

@AyushExel Thanks Ayush, you are awesome as always!!

hdnh2006 avatar Jan 19 '23 09:01 hdnh2006

I have this issue even on newest update

$ yolo detect train data=coco128.yaml model=yolov8n.pt epochs=100 imgsz=640 batch=4
Ultralytics YOLOv8.0.11 🚀 Python-3.10.6 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1650 Ti, 3912MiB)
yolo/engine/trainer: task=detect, mode=train, model=yolov8n.pt, data=coco128.yaml, epochs=100, patience=50, batch=4, imgsz=640, save=True, cache=False, device=, workers=8, project=None, name=None, exist_ok=False, pretrained=False, optimizer=SGD, verbose=False, seed=0, deterministic=True, single_cls=False, image_weights=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, hide_labels=False, hide_conf=False, vid_stride=1, line_thickness=3, visualize=False, augment=False, agnostic_nms=False, retina_masks=False, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=17, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, fl_gamma=0.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, cfg=None, hydra={'output_subdir': None, 'run': {'dir': '.'}}, v5loader=False, save_dir=runs/detect/train47

                   from  n    params  module                                       arguments                     
  0                  -1  1       464  ultralytics.nn.modules.Conv                  [3, 16, 3, 2]                 
  1                  -1  1      4672  ultralytics.nn.modules.Conv                  [16, 32, 3, 2]                
  2                  -1  1      7360  ultralytics.nn.modules.C2f                   [32, 32, 1, True]             
  3                  -1  1     18560  ultralytics.nn.modules.Conv                  [32, 64, 3, 2]                
  4                  -1  2     49664  ultralytics.nn.modules.C2f                   [64, 64, 2, True]             
  5                  -1  1     73984  ultralytics.nn.modules.Conv                  [64, 128, 3, 2]               
  6                  -1  2    197632  ultralytics.nn.modules.C2f                   [128, 128, 2, True]           
  7                  -1  1    295424  ultralytics.nn.modules.Conv                  [128, 256, 3, 2]              
  8                  -1  1    460288  ultralytics.nn.modules.C2f                   [256, 256, 1, True]           
  9                  -1  1    164608  ultralytics.nn.modules.SPPF                  [256, 256, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.Concat                [1]                           
 12                  -1  1    148224  ultralytics.nn.modules.C2f                   [384, 128, 1]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.Concat                [1]                           
 15                  -1  1     37248  ultralytics.nn.modules.C2f                   [192, 64, 1]                  
 16                  -1  1     36992  ultralytics.nn.modules.Conv                  [64, 64, 3, 2]                
 17            [-1, 12]  1         0  ultralytics.nn.modules.Concat                [1]                           
 18                  -1  1    123648  ultralytics.nn.modules.C2f                   [192, 128, 1]                 
 19                  -1  1    147712  ultralytics.nn.modules.Conv                  [128, 128, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.Concat                [1]                           
 21                  -1  1    493056  ultralytics.nn.modules.C2f                   [384, 256, 1]                 
 22        [15, 18, 21]  1    897664  ultralytics.nn.modules.Detect                [80, [64, 128, 256]]          
Model summary: 225 layers, 3157200 parameters, 3157184 gradients, 8.9 GFLOPs

Transferred 355/355 items from pretrained weights
optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.0005), 63 bias
train: Scanning /home/karol/Projekty/yolov8/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
val: Scanning /home/karol/Projekty/yolov8/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
Image sizes 640 train, 640 val
Using 4 dataloader workers
Logging results to runs/detect/train47
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      1/100      1.07G        nan        nan        nan         71        640: 100%|██████████| 32/32 [00:10<00:00,  3.04it/s]
/home/karol/Projekty/yolov8/yolov8-venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 16/16 [00:02<00:00,  5.95it/s]
                   all        128        929      0.697     0.0679     0.0821     0.0503

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      2/100      1.69G        nan        nan        nan         51        640: 100%|██████████| 32/32 [00:09<00:00,  3.47it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 16/16 [00:02<00:00,  6.10it/s]
                   all        128        929      0.672     0.0729      0.082      0.051

I try on nvidia drivers 525 (CUDA 12), 470 (CUDA 11.4), with ultralytics docker, etc My card is GTX 1650 Ti Mobile Ubuntu 22 Try on generic command from CLI and python (only down batch size, because have only 4GB card memory) Sems better when I down batch size to 1-3 or switch to CPU.

kosmicznemuchomory123pl avatar Jan 20 '23 21:01 kosmicznemuchomory123pl

Same problem with the latest version (8.0.17)

When lowering the batch size the losses seems to be working but the model is not learning properly.

Batch 16

Batch 8

xyrod6 avatar Jan 23 '23 08:01 xyrod6

Same problem with the latest version (8.0.17)

When lowering the batch size the losses seems to be working but the model is not learning properly.

Batch 16

Batch 8

In my experiment, I sometimes will be fixed by replacing the dataset, updating the PyTorch to latest version, or changing the gpu types. It seems not a model problem but is a AMP problem. If you don't want to try the above operation, you can try close AMP and only using FP32 by force all in FP32 and close autocast.

nikbobo avatar Jan 23 '23 09:01 nikbobo

Same problem with the latest version (8.0.17) When lowering the batch size the losses seems to be working but the model is not learning properly. Batch 16 Batch 8

In my experiment, I sometimes will be fixed by replacing the dataset, updating the PyTorch to latest version, or changing the gpu types. It seems not a model problem but is a AMP problem. If you don't want to try the above operation, you can try close AMP and only using FP32 by force all in FP32 and close autocast.

You are right the problem is with the GPU. It works on the 3090 and doesn't with the 1650. Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3090, 24265MiB) -- Working

Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1650, 3912MiB) -- Not working

xyrod6 avatar Jan 23 '23 13:01 xyrod6

Same problem with the latest version (8.0.17) When lowering the batch size the losses seems to be working but the model is not learning properly. Batch 16 Batch 8

In my experiment, I sometimes will be fixed by replacing the dataset, updating the PyTorch to latest version, or changing the gpu types. It seems not a model problem but is a AMP problem. If you don't want to try the above operation, you can try close AMP and only using FP32 by force all in FP32 and close autocast.

You are right the problem is with the GPU. It works on the 3090 and doesn't with the 1650. Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3090, 24265MiB) -- Working

Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1650, 3912MiB) -- Not working

This is totally true, I have a RTX2060 Super and I get the following logs: image

Meanwhile in my laptop with GTX 1650, the logs are like following (with an important warning): image

I have the same versions of PyTorch in both computers:

RTX2060:

Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch 
>>> torch.__version__
'1.13.1+cu117'

GTX1650:

Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch 
>>> torch.__version__
'1.13.1+cu117'

New important EDIT:

If I try the trianing on my laptop with GTX1650 but using the CPU, I don't get any nan values: image

So clearly, there's a compatibilty problem with this GPU.

hdnh2006 avatar Jan 23 '23 17:01 hdnh2006

I think this is the same problem in yolov5: https://github.com/ultralytics/yolov5/issues/7908 Must check with cuda10, but it required older system.

kosmicznemuchomory123pl avatar Jan 23 '23 20:01 kosmicznemuchomory123pl

I had the same errors propagate to yolov8 and yolov5, but I found a similar bug report for yolov5 that suggested disabling AMP by amp=False in train.py, which fixes box_loss and obj_loss equating to nan.

The other suggested fix for validation not working was that in train.py validation uses half accuracy and is half=amp in the validator() function (val in this thread) but by force assigning it half=False, it fixed my problem for training on yolov5 and training has resumed as usual using CUDA 11.7 with a Nvidia T1200 Laptop GPU (Compute Capability 7+).

Could perhaps be a problem with amp from CUDA since I even saw users in this thread have an issue with amp in CUDA 11.x and saw it solved when they reverted to CUDA 10.x.

Perhaps mirroring the fix found in this thread might help? Can't really find the equivalent variables to change in train.py, and was wondering where they were moved to in v8.

Thread for reference: https://github.com/ultralytics/yolov5/issues/7908

Hridh0y avatar Jan 25 '23 18:01 Hridh0y

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

  • Docs: https://docs.ultralytics.com
  • HUB: https://hub.ultralytics.com
  • Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

github-actions[bot] avatar Mar 23 '23 00:03 github-actions[bot]

Any solution to run YOLOV8 or YOLOV5 on NVIDIA GTX 1650

because still i am facing same error

Screenshot (2)

mithilnettyfy avatar Jun 09 '23 05:06 mithilnettyfy

@mithilnettyfy hey there! Thank you for reaching out to us. We apologize for the inconvenience you have faced while training YOLOv8 on your NVIDIA GTX 1650 GPU.

The issue you are facing could be related to a compatibility issue with your GPU or with the use of Automatic Mixed Precision (AMP). We recommend trying the following solutions:

  1. Disable AMP by setting amp=False in train.py while training the YOLOv8 model.

  2. Force all variables to run in FP32 instead of using both FP16 and FP32 by disabling autocast. You can do this by setting autocast=False in train.py while training the model.

If neither of these solutions work, we recommend checking the compatibility of your GTX 1650 GPU with the CUDA version you are using. Some users have reported issues with AMP in CUDA 11.x and have solved the problem by reverting back to CUDA 10.x.

Please let us know if this helps resolve your issue or if you have any further questions.

glenn-jocher avatar Jun 10 '23 03:06 glenn-jocher

Hey @glenn-jocher Thank you so much for helping to resolve this issue. My program is working perfectly but your second solution is not working. Could you please describe what exactly is the second point

There is no any argument autocast=FALSE https://docs.ultralytics.com/modes/train/#arguments

  1. Force all variables to run in FP32 instead of using both FP16 and FP32 by disabling autocast. You can do this by setting autocast=False in train.py while training the model.

Thank you in advance for your help. I appreciate it.

image

still getting 0 value on Box(P R mAP50 mAP50-95)

mithilnettyfy avatar Jun 12 '23 05:06 mithilnettyfy

Hello, I just faced the same problem in the platform AutoDL using YOLOv5. I solved it by cloning the latest version of YOLOv5 rather than using the YOLOv5 provided by the platform. I hope this tip can help you.

Chi-XU-Sean avatar Jun 29 '23 11:06 Chi-XU-Sean