hub icon indicating copy to clipboard operation
hub copied to clipboard

Model training not getting completed/ Disconnected. Stuck at 100%

Open Sudhir1609 opened this issue 1 year ago β€’ 18 comments

Search before asking

  • [X] I have searched the HUB issues and found no similar bug report.

HUB Component

Models

Bug

Its constantly getting stuck at 100% and not getting completed.

Model12

Environment

Ultralytics HUB Version v0.1.79 Client User Agent Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Operating System Linux x86_64 Browser Window Size 1848 x 932 Server Timestamp 1734061093

Minimal Reproducible Example

No response

Additional

No response

Sudhir1609 avatar Dec 13 '24 03:12 Sudhir1609

πŸ‘‹ Hello @Sudhir1609, thank you for reporting an issue about Ultralytics HUB πŸš€! Please check out our HUB Docs for more information:

  • Quickstart. Start training and deploying YOLO models with HUB in seconds.
  • Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
  • Projects: Creating and Managing. Group your models into projects for improved organization.
  • Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
  • Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
  • Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
    • iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
    • Android. Explore TFLite acceleration on mobile devices.
  • Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

It looks like you've reported a πŸ› bug where the model gets stuck at 100% completion and doesn't finalize. To help us investigate and resolve this, could you please provide a minimum reproducible example (MRE)? This includes:

  1. Detailed steps to reproduce the issue you're encountering.
  2. Screenshots or relevant logs that might give us more context.
  3. Information about any specific datasets, tasks, or customized configurations involved.

For guidance on creating an MRE, visit our Minimum Reproducible Example guide. πŸ› οΈ

An Ultralytics engineer will also review your issue and assist you shortly. Thank you for bringing this to our attention and for your patience! 😊

UltralyticsAssistant avatar Dec 13 '24 03:12 UltralyticsAssistant

@Sudhir1609 Hello! Can you share your model ID? You can find it on the URL of your model's page.

sergiuwaxmann avatar Dec 13 '24 07:12 sergiuwaxmann

@sergiuwaxmann is this the correct URL ? https://hub.ultralytics.com/models/dUv8hW4dgKL3dJu6TdaV

Sudhir1609 avatar Dec 13 '24 08:12 Sudhir1609

@Sudhir1609 Yes, this URL points to your model. I can see your model is disconnected and the last epoch is 95. You can try resuming the training while we investigate this issue further.

sergiuwaxmann avatar Dec 13 '24 09:12 sergiuwaxmann

I've tried 'Resume Training' like 5-6 times now and everytime it gets disconnected around the same epoch, I'm worried losing my funds too.

Sudhir1609 avatar Dec 13 '24 09:12 Sudhir1609

Thank you for sharing the update, @Sudhir1609! I understand how frustrating this must be, especially with the concern about funds.

To address this, please try the following steps:

  1. Check Your Internet Stability: Cloud training sessions can sometimes disconnect if there are interruptions in your network stability, so ensure you're on a reliable connection.

  2. Inspect the Logs: From the model page, review the training logs to see if there's any specific error or indication of what's causing the disconnection.

  3. Resume Training: Since the issue persists around the same epoch, try reducing your batch size or tweaking your dataset settings to see if that resolves any potential resource constraints. You can adjust these settings when resuming training.

  4. Funds and Billing: Rest assured, the HUB deducts funds only for completed epochs. If the session disconnects before completing an epoch, the balance for that epoch should not be affected. You can verify this via the Billing tab in the HUB.

If the issue persists and you've already tried the above steps, please let us know. You can also share with us any specific error messages or logs that appear before the disconnection. We'll investigate further to ensure this gets resolved for you.

Thank you for your patience! 😊

pderrenger avatar Dec 13 '24 10:12 pderrenger

Im not able to change any configuration, Only the Resume training option is enabled and to change the Instance. How can i reduce the epoch size or tweak my dataset settings ? @pderrenger

Sudhir1609 avatar Dec 13 '24 10:12 Sudhir1609

@Sudhir1609 Unfortunately, the number of epochs can't be changed after the model started training. Apologies for the inconvenience, we will refund the account balance you used so far for this training as we can see you tried resuming several times. Once we do this (you should see the account balance back in your account in about 30 minutes), maybe you can try creating a new model again and start a fresh training?

sergiuwaxmann avatar Dec 13 '24 10:12 sergiuwaxmann

@sergiuwaxmann Thanks, I was facing the same problem and tried the same steps for this model too. https://hub.ultralytics.com/models/MD72j92nP9uX9fwShDIS

Thanks for you help !

Sudhir1609 avatar Dec 13 '24 11:12 Sudhir1609

@Sudhir1609 You should have your account balance back. Maybe you can try choosing a different GPU? Which GPU did you use for the trainings that failed?

sergiuwaxmann avatar Dec 13 '24 11:12 sergiuwaxmann

I believe the size of your dataset causes OOM issue but we are still investigating this.

sergiuwaxmann avatar Dec 13 '24 11:12 sergiuwaxmann

@sergiuwaxmann I tried changing the instance between NVIDIA GeForce RTX 4090 and NVIDIA L40.

Thanks for the update. I'll try to change my dataset and try again

Sudhir1609 avatar Dec 14 '24 04:12 Sudhir1609

@sergiuwaxmann I changed the dataset size and tried training the model and faced with the same problem https://hub.ultralytics.com/models/zEDjZlwIbNiMnrD1qVtT

Can you please let me know about this

Sudhir1609 avatar Dec 16 '24 06:12 Sudhir1609

@Sudhir1609 Thank you for your patience as we continue to investigate this issue. We're currently working to identify the root cause, but reproducing the problem has been challenging due to the large size of the dataset involved.

Please rest assured that we're actively working on this and will keep you updated as soon as we have more information. Apologies for the inconvenience, and thank you for your understanding! πŸ™

yogendrasinghx avatar Dec 16 '24 10:12 yogendrasinghx

@Sudhir1609

Thank you for your patience and understanding as we looked into this issue. We have successfully reproduced the issue on our end and identified the root cause. The development team has been informed and is actively working on a fix.

We appreciate your cooperation and will update you as soon as the fix is deployed.

Thank you! 😊

yogendrasinghx avatar Dec 18 '24 07:12 yogendrasinghx

@yogendrasinghx Sure thanks for the update, Please let me know once the problem is fixed. Hope I can train the model seamlessly soon

Sudhir1609 avatar Dec 19 '24 04:12 Sudhir1609

Probably the same issue, i tried to train yolo11s on the hand keypoints dataset (https://docs.ultralytics.com/ru/datasets/pose/hand-keypoints/). The model is trained on my hardware, in the server console after the last epoch I get a message that the finished model is loaded in the hub:

Speed: 0.1ms preprocess, 2.6ms inference, 0.0ms loss, 0.5ms postprocess per image Results saved to runs/pose/train2 Ultralytics HUB: Syncing final model... 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 19.5M/19.5M [00:01<00:00, 11.9MB/s] Ultralytics HUB: Done Ultralytics HUB: View model at https://hub.ultralytics.com/models/gZakA8v871qsfruVApnw

In the hub itself it says β€œ0 epochs remaining” and also plots for all 100 epochs, but the status is β€œDisconnected. Checkpoint saved for epoch 11.”. It seems after some point checkpoints just stop being recorded for some unknown reason

Image

Preview and Deploy tabs are not available with the message β€œModel Not Trained!”. I didn't attach the entire output of the training script for all 100 epochs, it's quite long. I can only say that there was not a single traceback about loss of connection or anything like that and in general any errors on the side of my server did not seem to occur. When trying to run the training script on the server again, the training tries to start from epoch 11 but ends with traceback:

Ultralytics HUB: New authentication successful βœ…
Ultralytics HUB: View model at https://hub.ultralytics.com/models/gZakA8v871qsfruVApnw Found https://storage.googleapis.com/ultralytics-hub.appspot.com/users/XMxHsmLiXfWigzc4Pp3jV3shn0C2/models/gZakA8v871qsfruVApnw/epoch-11.pt locally at weights/epoch-11.pt Traceback (most recent call last): File "/home/paulo/sources/yolo11_hand_pose/train.py", line 6, in model = YOLO('https://hub.ultralytics.com/models/gZakA8v871qsfruVApnw') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/paulo/sources/yolo11_hand_pose/yolo11_hand_pose_env/lib64/python3/site-packages/ultralytics/models/yolo/model.py", line 23, in init super().init(model=model, task=task, verbose=verbose) File "/home/paulo/sources/yolo11_hand_pose/yolo11_hand_pose_env/lib64/python3/site-packages/ultralytics/engine/model.py", line 148, in init self._load(model, task=task) File "/home/paulo/sources/yolo11_hand_pose/yolo11_hand_pose_env/lib64/python3/site-packages/ultralytics/engine/model.py", line 290, in _load self.model, self.ckpt = attempt_load_one_weight(weights) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/paulo/sources/yolo11_hand_pose/yolo11_hand_pose_env/lib64/python3/site-packages/ultralytics/nn/tasks.py", line 1039, in attempt_load_one_weight ckpt, weight = torch_safe_load(weight) # load ckpt ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/paulo/sources/yolo11_hand_pose/yolo11_hand_pose_env/lib64/python3/site-packages/ultralytics/nn/tasks.py", line 944, in torch_safe_load ckpt = torch.load(file, map_location="cpu") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/paulo/sources/yolo11_hand_pose/yolo11_hand_pose_env/lib64/python3/site-packages/ultralytics/utils/patches.py", line 86, in torch_load return _torch_load(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/paulo/sources/yolo11_hand_pose/yolo11_hand_pose_env/lib64/python3/site-packages/torch/serialization.py", line 1495, in load return _legacy_load( ^^^^^^^^^^^^^ File "/home/paulo/sources/yolo11_hand_pose/yolo11_hand_pose_env/lib64/python3/site-packages/torch/serialization.py", line 1744, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ _pickle.UnpicklingError: invalid load key, '<'.

I have tried creating a project and training a model before and encountered a similar problem, but there I was able to run the training again however the checkpoints still did not want to record beyond the point where they were stuck on

antoune-trash avatar Mar 21 '25 00:03 antoune-trash

Thank you for the detailed report and for including the traceback! πŸ› οΈ This appears to be a checkpoint syncing issue that we're actively investigating. Here's what we recommend:

  1. New Model Instance: Since the existing checkpoint file (epoch-11.pt) appears corrupted (as indicated by the UnpicklingError), please create a fresh model in the Ultralytics HUB and start new training. You can use the same hand-keypoints dataset configuration from the Ultralytics Hand Keypoints documentation.

  2. Epoch Validation: When starting fresh training, you might want to:

    model.train(..., epochs=100, patience=50)  # Adjust patience relative to epochs
    

    This helps ensure completion even if some epochs take longer.

  3. Checkpoint Monitoring: After starting new training, watch the HUB Training Logs in real-time to catch any mid-training sync issues.

  4. Network Stability: Ensure stable internet connection during training, as checkpoints sync automatically after each epoch. Our Cloud Training solution can help avoid local network issues.

We're working on improving checkpoint resilience in these edge cases. For immediate needs, the fresh training approach should unblock you. Let us know if the issue persists in new trainings, and we'll prioritize further investigation.

Note: If you're using this for internal/proprietary purposes, please ensure compliance with our AGPL-3.0 license requirements.

pderrenger avatar Mar 21 '25 07:03 pderrenger

πŸ‘‹ Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

  • Docs: https://docs.ultralytics.com
  • HUB: https://hub.ultralytics.com
  • Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO πŸš€ and Vision AI ⭐

github-actions[bot] avatar Nov 23 '25 00:11 github-actions[bot]