yolov5 Training not working when dataset is defined as clearml://xxxxxxxx

Search before asking

[X] I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

Using next command line: python3 train.py --img 512 --rect --batch 250 --epochs 300 --workers 64 --data clearml://6a85a2b1ee014e0e9de0b53a8c957df9

Dataset is download right into the right folder, but we got next error when train.py must start to load the dataset:

2023-01-05 11:00:27,352 - clearml - INFO - Dataset.get() did not specify alias. Dataset information will not be automatically logged in ClearML Server.
ClearML: WARNING ⚠️ ClearML is installed but not configured, skipping ClearML logging. See https://github.com/ultralytics/yolov5/tree/master/utils/loggers/clearml#readme
Traceback (most recent call last):
  File "train.py", line 635, in <module>
    main(opt)
  File "train.py", line 529, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 112, in train
    data_dict = data_dict or check_dataset(data)  # check if None
  File "/home/paco/AI/aistage3/yolov5/utils/general.py", line 595, in yaml_load
    with open(file, errors='ignore') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'clearml://6a85a2b1ee014e0e9de0b53a8c957df9'

Environment

Ubuntu 22.04 Last Yolov5


GitPython == 3.1.29
Pillow == 9.2.0
PyYAML == 6.0
clearml == 1.9.0
ipython == 8.4.0
matplotlib == 3.5.2
numpy == 1.23.1
opencv_python == 4.6.0.66
pandas == 1.4.3
psutil == 5.9.1
requests == 2.28.1
scipy == 1.8.1
seaborn == 0.11.2
setuptools == 44.0.0
tensorboard == 2.9.1
thop == 0.1.1.post2207130030
torch == 1.13.0
torchvision == 0.14.0
tqdm == 4.64.0

Minimal Reproducible Example

Any clearml dataset ID will be enough

python3 train.py --img 512 --rect --batch 250 --epochs 300 --workers 64 --data clearml://6a85a2b1ee014e0e9de0b53a8c957df9

Additional

No response

Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

Jan 05 '23 10:01 bzisl

With the ClearML integration, it tries to download a local copy of the dataset. You need to make sure that the 'file' argument you're passing to the yaml_load function is indeed the correct location of the yaml file (data_config.yaml for example) from the dataset folder that gets downloaded onto your machine. Something like: clearml://6a85a2b1ee014e0e9de0b53a8c957df9/data_config.yaml

Jan 06 '23 09:01 AnweshCR7

@bzisl Thanks for reporting!

What seems to have happened is that there was an error when initializing ClearML. This is the line that tells you this:

ClearML: WARNING ⚠️ ClearML is installed but not configured, skipping ClearML logging. See https://docs.ultralytics.com/yolov5/tutorials/clearml_logging_integration#readme

But this will not stop the execution flow, so it means YOLOv5 tries to interpret the clearml dataset ID as if it was a local file! Which leads to the error.

First of all, can you please check if you are properly logged into clearml and you have run clearml-init. Without this ClearML cannot initialize and so it cannot handle the clearml:// link.

If you are properly logged in, but still getting the error, would you mind adding these 2 changes to the file at utils/loggers/__init__.py at line 123 (link). They should print the error that is stopping ClearML from initializing, giving us more info to diagnose the issue :)

Jan 12 '23 15:01 thepycoder

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

Feb 12 '23 00:02 github-actions[bot]

Hi, I met a similar problem while working in the ultralytics/yolov5:v7.0 docker image environment. Everything worked fine while training using 1 GPU; but some problem came up during DDP training. The clearml seemed working fine, dataset had been downloaded to local successfully, but data_dict is not parsed correctly. I suppose this could be related with the following part in the train.py The code data_dict = data_dict or check_dataset(data) in the with torch_distributed_zero_first(LOCAL_RANK) block didn't work well for non-zero LOCAL_RANK, the data_dict was still None, so in data_dict = data_dict or check_dataset(data) , the "clearml://id "was fed to method check_dataset(), involving the FileNotFoundError

I'm not sure how with torch_distributed_zero_first(LOCAL_RANK) works, does anyone got any clues? Your help will be much appreciated.

Apr 24 '23 03:04 zoidburg

@zoidburg hi there! It looks like the issue you're experiencing has to do with using YOLOv5 in a distributed data parallel (DDP) training setup. Specifically, the torch_distributed_zero_first context manager decorator sets the parameters of the optimizer to zero in order to begin the training process with a consistent initial value across all processes. When training with multiple GPUs using DDP, the LOCAL_RANK environment variable is set by PyTorch to specify the rank of the current GPU among all GPUs being used.

It sounds like you're seeing None values for data_dict when running the code with LOCAL_RANK greater than 0. Without more context or information about the dataset you're using, it's difficult to say for certain what might be causing this issue. However, based on the error message you posted, it appears that the check_dataset function is looking for a local file at the path specified by the file variable. Since you're using ClearML as a provider for your dataset, the path specified by file should be a remote URI that ClearML can handle, like clearml://<dataset_id>/<filename>.

If you're encountering this error even with the correct URI format, the issue could potentially be related to the way that YOLOv5 handles DDP training. Unfortunately, without more information it's difficult to say for certain what the issue might be. If you could provide more context or details about the problem you're experiencing, I'd be happy to try to help you troubleshoot further.

Apr 24 '23 07:04 glenn-jocher

@glenn-jocher , thanks for reply! I was following this document to test the clearml integration. I had clearml installed and inited successfully.

I reproduced this problem with the example coco128 dataset. I uploaded the dataset to clearml as the tutorial mentioned, with the config coco128.yaml inside(you need to complete the missing keys/values in the original yaml). Then the clearml dataset was used in the training.

Training with 1 gpu worked just fine, the dataset was successfully loaded by the dataloader, so the the clearml was properly configed; when trained with 2 gpus in DDP mode, the FileNotFoundError came out. python -m torch.distributed.run --nproc_per_node 2 train.py --img 640 --batch 16 --epochs 3 --data clearml://3bf8443be3504d8ebc2d02c5bb3e225c --weights yolov5s.pt --cache

I tried to pass parameter like clearml://<dataset_id>/<yaml> to the check_datasetfunction and it didn't work, but the path of the local copy of clearml dataset could be procssed properly(something like /opt/yolov5/.clearml/cache/storage_manager/datasets/xxxxx/yaml ). And I found the clearml dataset was handled in the construct_dataset function, in utils/loggers/clearml/clearml_utils.py, with the clearml.Dataset.

In the torch_distributed_zero_first, the torch.distributed.barrier() is used. So the non_zero RANKs first reaches the barrier and wait there, until RANK 0 reaches barrier; the RANK 0 first yields and runs codes in the block, then reaches the barrier. This makes sure that the RANK 0 processes data_dict = data_dict or check_dataset(data) while other non-zero RANKs waiting in the barrier. In the RANK 0, as set by data_dict = loggers.remote_dataset in the previous if RANK in {-1, 0} block, the data_dict has been properly parsed, the check_dataset won't actually run. After processing this line, RANK 0 reaches the barrier and all RANKs carry on processing. I print the data_dict value in the torch_distributed_zero_first block, like this and the result is like I'm confused that in the non-zero RANKs, the data_dict is still None, even they waited until RANK 0 updated the data_dict. Shouldn't it be synced with the correct value as RANK 0?

Thank you for always!

Apr 26 '23 03:04 zoidburg

@zoidburg Thanks for providing such a detailed and thorough analysis! After reviewing your findings, it seems that the issue is stemming from the data synchronization across the different ranks during the DDP training. Based on your explanation, it appears that the data_dict is not being properly synchronized across all ranks, resulting in None values for non-zero ranks even after waiting for the update from rank 0.

In a DDP setup, the torch.distributed.barrier() is indeed used to synchronize processes at a specific point to ensure that all processes reach the same state before proceeding. However, it seems that the synchronization of the data_dict is not functioning as expected across ranks.

It's possible that this synchronization issue may need further investigation into how the data_dict is being shared among processes and whether there are any race conditions or timing issues causing the incorrect values.

I would recommend checking the PyTorch DDP documentation and potentially reaching out to the PyTorch community or forums for further insights into DDP synchronization intricacies. Additionally, if you believe this issue to be YOLOv5-specific, you might consider opening a discussion or issue in the YOLOv5 repository for the Ultralytics team to weigh in on the matter.

Thank you for your diligence in investigating this issue, and I hope this additional information helps in troubleshooting the synchronization problem you've encountered.

Nov 15 '23 16:11 glenn-jocher

yolov5 yolov5 copied to clipboard

Training not working when dataset is defined as clearml://xxxxxxxx

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

yolov5
yolov5 copied to clipboard