ultralytics icon indicating copy to clipboard operation
ultralytics copied to clipboard

CUDA error when training on rectangular inputs at full resolution

Open eqsmy opened this issue 3 years ago • 1 comments

Search before asking

  • [X] I have searched the YOLOv8 issues and discussions and found no similar questions.

Question

The Component

Training

The Issue

I keep running into a CUDA error when trying to train on rectangular images. My images are 1920x1080, and I was able to train just fine on the default image size of 640. I then tried to specify imgsz as [1920,1080] and ran into the same as #785 , so in response to that thread, I changed the imgsz to just 1280 and rect=True. Still wouldn't be full resolution but better than 640.

I tried a few resolutions but the best I was able to successfully start training on was imgsz=1056, which the largest multiple of 32 less than the short side of my images (1080). Training on that now so I'll see how that goes. But I would like to be able to use as full resolution as possible because some of the objects I am trying to detect are quite small.

The Call

results = model.train(data=f"{path2}", epochs=100, device=0, imgsz=1280, rect=True, cache=False)

Backtrace

Traceback (most recent call last): File "train.py", line 32, in train() File "train.py", line 13, in train results = model.train(data=f"{path2}", epochs=100, device=0,imgsz=1280, rect=True, cache=False) # train the model File "/home/elysiaqs/.local/lib/python3.8/site-packages/ultralytics/yolo/engine/model.py", line 201, in train self.trainer.train() File "/home/elysiaqs/.local/lib/python3.8/site-packages/ultralytics/yolo/engine/trainer.py", line 183, in train self._do_train(int(os.getenv("RANK", -1)), world_size) File "/home/elysiaqs/.local/lib/python3.8/site-packages/ultralytics/yolo/engine/trainer.py", line 309, in _do_train self.scaler.scale(self.loss).backward() File "/home/elysiaqs/.local/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/home/elysiaqs/.local/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA error: unknown error CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Additional

No response

eqsmy avatar Feb 03 '23 18:02 eqsmy

I see the same error today. I add some new images with new class on roboflow and generate the new dataset from roboflow, then the yolo8 throw this error. When I try to use the old dataset, it works well. So I am not sure it's the problem of yolov8 or roboflow dataset? I search this error online, it shows the problem is the number of classes.

darouwan avatar Feb 11 '23 17:02 darouwan

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

  • Docs: https://docs.ultralytics.com
  • HUB: https://hub.ultralytics.com
  • Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

github-actions[bot] avatar Mar 14 '23 00:03 github-actions[bot]

@darouwan it's unfortunate that you're encountering this CUDA error while training YOLOv8 on rectangular images. The issue might not be directly related to the number of classes, as the error message suggests a CUDA-related problem rather than a specific class issue. However, it's beneficial to check if any changes in the class distribution or dataset properties between the old and new datasets might be triggering this error.

In general, optimizing the choice of input resolution via imgsz and the rect flag helps balance speed and accuracy during training. It's commendable that you're striving for the highest feasible resolution to detect small objects. Keep in mind that excessively large images may cause GPU memory issues, so finding the right trade-off between resolution and memory usage is vital.

To address the CUDA error, try updating your GPU drivers, using the latest PyTorch version, and ensuring that your CUDA toolkit is compatible with PyTorch. Additionally, setting CUDA_LAUNCH_BLOCKING=1 can help diagnose asynchronous CUDA kernel errors.

Lastly, if the issue persists, considering raising an issue on the YOLOv8 repo with additional details about your setup, including GPU type, PyTorch version, and any relevant environment configurations, could facilitate a more targeted resolution.

I hope these suggestions lead to a successful resolution!

pderrenger avatar Nov 16 '23 06:11 pderrenger