hub icon indicating copy to clipboard operation
hub copied to clipboard

Resolving MPS Incompatibility for Yolo V8 Training on MacBook Pro M3 with GPU

Open RaulArtigues opened this issue 1 year ago • 17 comments

Search before asking

  • [X] I have searched the HUB issues and found no similar bug report.

HUB Component

No response

Bug

Hello,

I am encountering a similar issue when attempting to conduct Yolo V8 training using the Ultralytics library. I am faced with an MPS incompatibility issue on my MacBook Pro M3. It is crucial to emphasize that I am aiming to utilize the computer's GPU and NOT the CPU. Could you provide guidance on how to overcome this compatibility challenge so that I can efficiently use the GPU for my training?

Thank you very much for your assistance.

Environment

No response

Minimal Reproducible Example

No response

Additional

No response

RaulArtigues avatar Feb 28 '24 13:02 RaulArtigues

👋 Hello @RaulArtigues, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

  • Quickstart. Start training and deploying YOLO models with HUB in seconds.
  • Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
  • Projects: Creating and Managing. Group your models into projects for improved organization.
  • Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
  • Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
  • Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
    • iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
    • Android. Explore TFLite acceleration on mobile devices.
  • Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

github-actions[bot] avatar Feb 28 '24 13:02 github-actions[bot]

@RaulArtigues Thanks for raising this. I am looking into a solution at the moment after I saw your response on the existing thread. I will get back when we find an answer.

kalenmike avatar Feb 28 '24 13:02 kalenmike

@RaulArtigues We need to app support for mps to HUB. I will try to get this out today.

kalenmike avatar Feb 28 '24 13:02 kalenmike

OK. Thank you

RaulArtigues avatar Feb 28 '24 14:02 RaulArtigues

If you need me to provide more information about the error or screenshot, tell me.

RaulArtigues avatar Feb 28 '24 14:02 RaulArtigues

@RaulArtigues thank you for offering additional details! For now, we should have enough information to investigate the MPS incompatibility issue. If we require any further information or screenshots, we'll reach out. Your cooperation is greatly appreciated! 🙌

UltralyticsAssistant avatar Feb 28 '24 22:02 UltralyticsAssistant

Hi there,

I think I got a similar issue when trying to train on M3 Macbook Pro. Using: model.train(data='data.yaml',epochs=1,imgsz=100,device='mps') I got: RuntimeError: Trying to create tensor with negative dimension -1: [1, -1, 5]

Can it be related as well to github issue: Runtime Error when Training While Using MPS #7807

I tried to re-run by setting batch parameter to default fixed value batch=16 model.train(data='data.yaml',epochs=1,imgsz=100,device='mps',batch=16) -> got same error

It tried to re-run by setting to dynamic value batch=-1 model.train(data='data.yaml',epochs=1,imgsz=100,device='mps',batch=-1) -> got AssertionError: Torch not compiled with CUDA enabled

It looks that training is happening, but failing when Updating model and cfg after training.

Thanks for your help

cl3m3nt avatar Mar 02 '24 17:03 cl3m3nt

@cl3m3nt hello,

It sounds like you're encountering a complex issue related to MPS support and tensor dimensions. The error message you're seeing, "Trying to create tensor with negative dimension -1," typically indicates a problem with the input data or model configuration rather than a direct issue with MPS compatibility. However, the subsequent errors and your troubleshooting steps suggest there might be deeper compatibility concerns with MPS on the M3 MacBook Pro.

For now, please ensure your dataset is correctly formatted and accessible. Also, double-check your model configuration for any potential issues. Since you've already attempted adjusting the batch size without success, it might not be a straightforward configuration issue.

Given the complexity and the potential relation to MPS support, we'll need to investigate this further. Your detailed report is very helpful, and we appreciate your patience as we work through these challenges. We're committed to improving MPS support and will keep the community updated on our progress.

Thank you for your understanding and cooperation.

UltralyticsAssistant avatar Mar 02 '24 20:03 UltralyticsAssistant

Having this issue too. M2. But only on certain datasets.

I am exporting my datasets from roboflow.

setting batch=2 solves it!

fosteman avatar Mar 02 '24 20:03 fosteman

Hi UltralyticsAssistant,

thanks for your prompt reply. The dataset I am using is the same which is working without device='mps' option. The dataset is exported from Roboflow and uses yolo format. The model as well is the same which is working without device='mps' option. Training is working fine when the device option is not set.

I appreciate your help, no hurry ;)

cl3m3nt avatar Mar 03 '24 12:03 cl3m3nt

@cl3m3nt, thank you for the additional details and for your patience. It's insightful to know that your dataset and model work fine without specifying device='mps'. This indeed suggests that the issue is more closely related to MPS support rather than the dataset or model configuration itself.

Given that the training operates as expected on other devices but encounters issues on MPS, this reinforces the need for a deeper dive into how MPS compatibility is handled, especially with the nuances introduced by different MacBook models and their GPU capabilities.

Your understanding and cooperation are much appreciated as we navigate through these challenges. Rest assured, we are actively working on enhancing MPS support to ensure a smoother experience across all compatible devices. Your case provides valuable insight into the specific scenarios we need to address.

Thank you once again for your contribution and for being part of the Ultralytics community. We'll keep you updated on our progress.

UltralyticsAssistant avatar Mar 03 '24 15:03 UltralyticsAssistant

The problem appears in ultralytics/utils/loss.py. I'm not sure how it will influence on the model in the future, but I solved it by this way:

image

NikyParfenov avatar Sep 26 '24 15:09 NikyParfenov

Hello,

Thank you for sharing your solution and insights! It's great to see community members actively troubleshooting and contributing to potential fixes.

To ensure the issue is fully addressed, please verify that the problem persists in the latest version of the Ultralytics packages. We regularly update our codebase, and your feedback is invaluable in helping us improve.

If you encounter any further issues or have additional insights, feel free to share them here. Your contributions are highly appreciated by the entire YOLO community! 😊

Thank you for being an active part of our community!

glenn-jocher avatar Sep 26 '24 18:09 glenn-jocher

Hello,

Thank you for sharing your solution and insights! It's great to see community members actively troubleshooting and contributing to potential fixes.

To ensure the issue is fully addressed, please verify that the problem persists in the latest version of the Ultralytics packages. We regularly update our codebase, and your feedback is invaluable in helping us improve.

If you encounter any further issues or have additional insights, feel free to share them here. Your contributions are highly appreciated by the entire YOLO community! 😊

Thank you for being an active part of our community!

Hi! Yep, I tried to upgrade ultralytics but the problem still exists. So, I solved it again, using max(0, counts.max()) :)

NikyParfenov avatar Sep 27 '24 12:09 NikyParfenov

Hi there!

Thanks for confirming that the issue persists in the latest version and for sharing your workaround using max(0, counts.max()). It's great to see your proactive approach! 😊

We'll continue to investigate this on our end to ensure a more robust solution. If you have any more insights or encounter further issues, please feel free to share them. Your contributions are incredibly valuable to us and the entire YOLO community.

Thank you for your support and collaboration!

glenn-jocher avatar Sep 27 '24 19:09 glenn-jocher