hub icon indicating copy to clipboard operation
hub copied to clipboard

Problem resuming training in Google Colab

Open sebasmej opened this issue 9 months ago • 2 comments

Search before asking

  • [X] I have searched the HUB issues and found no similar bug report.

HUB Component

Training

Bug

I am training a model using google colab (it is not the first model I train in this way) and when I try to resume executing the commands:

%pip install ultralytics  # install
from ultralytics import YOLO, checks, hub
checks()  # checks

hub.login('my_API_KEY')
model = YOLO('my_MODEL_ID')
results = model.train()

the following error message appears:

requirements: Ultralytics requirement ['hub-sdk>=0.0.6'] not found, attempting AutoUpdate...
Collecting hub-sdk>=0.0.6
  Downloading hub_sdk-0.0.8-py3-none-any.whl (40 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.9/40.9 kB 2.4 MB/s eta 0:00:00
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from hub-sdk>=0.0.6) (2.31.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->hub-sdk>=0.0.6) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->hub-sdk>=0.0.6) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->hub-sdk>=0.0.6) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->hub-sdk>=0.0.6) (2024.2.2)
Installing collected packages: hub-sdk
Successfully installed hub-sdk-0.0.8

requirements: AutoUpdate success ✅ 6.0s, installed 1 package: ['hub-sdk>=0.0.6']
requirements: ⚠️ Restart runtime or rerun command for updates to take effect

Ultralytics HUB: New authentication successful ✅
Ultralytics HUB: View model at https://hub.ultralytics.com/models/6SUZnsAo0z0y6gld7lpp 🚀
Downloading https://storage.googleapis.com/ultralytics-hub.appspot.com/users/gR39oPibZKaU7n6mUI0WE1H1CQH2/models/6SUZnsAo0z0y6gld7lpp/epoch-32.pt to 'epoch-32.pt'...
⚠️ Download failure, retrying 1/3 https://storage.googleapis.com/ultralytics-hub.appspot.com/users/gR39oPibZKaU7n6mUI0WE1H1CQH2/models/6SUZnsAo0z0y6gld7lpp/epoch-32.pt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=firebase-adminsdk-jsjt9%40ultralytics-hub.iam.gserviceaccount.com%2F20240430%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240430T070930Z&X-Goog-Expires=900&X-Goog-SignedHeaders=host&X-Goog-Signature=5bc8969abb8a1e1ee6b7518609a6a883f6276e7f9ee851ce01edf394f85b58a1e595a775690399a9f569b14cbfe7b3fd299049484b41a34e9cdc8002ed711d399bd0d2b61c01776b258a87ba3bf78b786a522e601f1413e508d8e3d61c6f0d89e76fe6cdc64a0b8e726cf24b0c701c9a6a679cce954bd385cd4714d92ba336c9bb6faea48f3bcb3eecfecdaa7e1fb7b4316bc34d042a31c79f79c4ea764d54e3632132246cbe6e9f37d494f87f9361d0251673517fbd03a6522650f9c3cfedfaf96526ef8f4a64a1da97e8d7904493489c484e339b72390012ad33d5faf7c172e810364057072fd535abb3f4368e96b0a8aa132eb2402b5b93959369719b5e59...
---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
[<ipython-input-2-600de03de6f2>](https://localhost:8080/#) in <cell line: 3>()
      1 hub.login('67f19bbd86bcc04db7747d501c4e11246ac092e81a')
      2 
----> 3 model = YOLO('https://hub.ultralytics.com/models/6SUZnsAo0z0y6gld7lpp')
      4 results = model.train()

6 frames
[/usr/local/lib/python3.10/dist-packages/ultralytics/models/yolo/model.py](https://localhost:8080/#) in __init__(self, model, task, verbose)
     21         else:
     22             # Continue with default YOLO initialization
---> 23             super().__init__(model=model, task=task, verbose=verbose)
     24 
     25     @property

[/usr/local/lib/python3.10/dist-packages/ultralytics/engine/model.py](https://localhost:8080/#) in __init__(self, model, task, verbose)
    149             self._new(model, task=task, verbose=verbose)
    150         else:
--> 151             self._load(model, task=task)
    152 
    153     def __call__(

[/usr/local/lib/python3.10/dist-packages/ultralytics/engine/model.py](https://localhost:8080/#) in _load(self, weights, task)
    238 
    239         if Path(weights).suffix == ".pt":
--> 240             self.model, self.ckpt = attempt_load_one_weight(weights)
    241             self.task = self.model.args["task"]
    242             self.overrides = self.model.args = self._reset_ckpt_args(self.model.args)

[/usr/local/lib/python3.10/dist-packages/ultralytics/nn/tasks.py](https://localhost:8080/#) in attempt_load_one_weight(weight, device, inplace, fuse)
    804 def attempt_load_one_weight(weight, device=None, inplace=True, fuse=False):
    805     """Loads a single model weights."""
--> 806     ckpt, weight = torch_safe_load(weight)  # load ckpt
    807     args = {**DEFAULT_CFG_DICT, **(ckpt.get("train_args", {}))}  # combine model and default args, preferring model args
    808     model = (ckpt.get("ema") or ckpt["model"]).to(device).float()  # FP32 model

[/usr/local/lib/python3.10/dist-packages/ultralytics/nn/tasks.py](https://localhost:8080/#) in torch_safe_load(weight)
    730             }
    731         ):  # for legacy 8.0 Classify and Pose models
--> 732             ckpt = torch.load(file, map_location="cpu")
    733 
    734     except ModuleNotFoundError as e:  # e.name is missing module name

[/usr/local/lib/python3.10/dist-packages/torch/serialization.py](https://localhost:8080/#) in load(f, map_location, pickle_module, weights_only, mmap, **pickle_load_args)
   1038             except RuntimeError as e:
   1039                 raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
-> 1040         return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
   1041 
   1042 

[/usr/local/lib/python3.10/dist-packages/torch/serialization.py](https://localhost:8080/#) in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
   1256             "functionality.")
   1257 
-> 1258     magic_number = pickle_module.load(f, **pickle_load_args)
   1259     if magic_number != MAGIC_NUMBER:
   1260         raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, '<'.

Environment

Google Colab

Minimal Reproducible Example

  1. Login to hub
  2. Search the model to train
  3. Click to copy the Colab code
  4. Follow the steps on the Google Colab notebook
  5. Error appears

Additional

No response

sebasmej avatar Apr 30 '24 14:04 sebasmej

👋 Hello @sebasmej, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

  • Quickstart. Start training and deploying YOLO models with HUB in seconds.
  • Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
  • Projects: Creating and Managing. Group your models into projects for improved organization.
  • Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
  • Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
  • Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
    • iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
    • Android. Explore TFLite acceleration on mobile devices.
  • Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

github-actions[bot] avatar Apr 30 '24 14:04 github-actions[bot]

Hello! It seems like there was an issue downloading your model weights from the server, which led to a corrupted file. This can happen due to network connectivity problems or server-side issues occasionally.

Here's a quick checklist to try and resolve this problem:

  1. Rerun the Training Cell: Sometimes, simply rerunning the command can resolve the issue as it might have been a temporary connectivity problem.
  2. Check Internet Connection: Ensure your Colab notebook has a stable internet connection. Changing network environments can sometimes help.
  3. Clear Colab Environment: Restart your Colab runtime and clear any cached data. It's also good practice to delete any corrupted weight files if they've been downloaded.

Should the issue persist after these steps, please open a new issue with details of the error after rerun for further investigation. Sometimes, certain issues might be tied to transient conditions on the server or network, and providing fresh context helps us identify if there's a new problem.

Thank you for reaching out! Your contributions help the community and the development of our platform. 🚀

pderrenger avatar Apr 30 '24 22:04 pderrenger

Closing this issue as it is duplicated by #674.

sergiuwaxmann avatar May 06 '24 12:05 sergiuwaxmann