Dataset upload structuring
Search before asking
- [X] I have searched the HUB issues and found no similar bug report.
HUB Component
Datasets
Bug
Dataset structure shown on HUB
Tested working dataset structure and YAML file
data
└───data-seg20
├───data.yaml
├───train
│ ├───images
│ └───labels
└───valid
├───images
└───labels
path: ../data-seg20
train: train/images
val: valid/images
test: null
names:
0: crack
Also mentioned on https://github.com/ultralytics/hub/issues/569#issuecomment-1953054027
Environment
OS Windows-10-10.0.19045-SP0
Environment Windows
Python 3.11.6
Install git
RAM 31.86 GB
CPU Intel Core(TM) i5-10600K 4.10GHz
CUDA 12.1
matplotlib ✅ 3.8.2>=3.3.0
numpy ✅ 1.26.3>=1.22.2
opencv-python ✅ 4.9.0.80>=4.6.0
pillow ✅ 10.2.0>=7.1.2
pyyaml ✅ 6.0.1>=5.3.1
requests ✅ 2.31.0>=2.23.0
scipy ✅ 1.11.4>=1.4.1
torch ✅ 2.1.1+cu121>=1.8.0
torchvision ✅ 0.16.1+cu121>=0.9.0
tqdm ✅ 4.66.1>=4.64.0
psutil ✅ 5.9.7
py-cpuinfo ✅ 9.0.0
thop ✅ 0.1.1-2209072238>=0.1.1
pandas ✅ 2.1.4>=1.1.4
seaborn ✅ 0.13.1>=0.11.0
Minimal Reproducible Example
No response
Additional
I attempted again with the tiger pose dataset which uploaded with out issue, but failed due to a Timeout error. Retrying immediately raised a timeout error.
NOTE eventually the tiger pose dataset shows no errors, but I was not observing (or timing) when this occurred.
@Burhan-Q We have error handling in place to manage multiple different formats, but we only suggest the correct one. I am not clear if you are stating that you are not able to upload a dataset format like the example or only that you can upload a dataset formatted differently?
The timeout error suggest that there was an issue connected with the server, we allow a retry option from the dropdown in those cases.
@kalenmike I was only able to upload a dataset with the structure mentioned in the opening comment. It is not possible to upload a dataset using the shown layout, not only did I have this issue it's been experienced by other users (how it was brought to my attention).
With respect to the timeout error, I did attempt a retry and when I did it immediately failed again, but I may not have waited enough time to try again. The timeout error seemingly "resolved itself" as it showed as correctly uploaded some time after uploading with no preventing.
One thing that was frustrating about the dataset uploading errors is that there is no indication as to what the error is or what the problem might be. This means that if an upload fails, as a user I have no clue why or what to change/fix. Having some kind of report of what errors occurred would be helpful.
@Burhan-Q There is error reporting, it sounds like you just had the same issue every time. Timeout is no response from the server. We also have:
- "YAML Not Found."
- "Multiple YAMLs Found."
- "Zip Formatted Incorrectly."
- "Dataset Empty."
- "YAML Formatting Error."
- "Processing Error."
- "Unable to Reach Server."
I may need to run through your issue with you tomorrow.
Also it looks like your dataset did not work because your YAML is not correct. Your YAML is telling us to look back a directory which is why you had to add another directory for it to work.
If you see the example YAML in HUB you will see there is no path key.
@kalenmike that's the crazy part, the YAML with path: ../data-seg20 did work for me yesterday.
I decided to do some testing and I'm wondering if something was strange in particular in the last few days because all of the iterations I tested below worked without error. I tested changing the directory structure by varying the presence of a subdirectory in the .zip and by changing the directory layout (I call them out as HUB vs YOLO formats) as well as by varying the use of path: ../VisDrone20 vs path: VisDrone20 with the different dataset layouts.
Retesting 2024-02-20
Test 1
- [x] Successfully uploaded to HUB without errors
- use
path: ../VisDrone20in YAML - includes subdirectory in
.zip - use HUB dataset example structure
Details
VisDrone20.yaml
path: ../VisDrone20
train: images/train
val: images/val
test: null
names:
0: pedestrian
1: people
2: bicycle
3: car
4: van
5: truck
6: tricycle
7: awning-tricycle
8: bus
9: motor
VisDrone20.zip structure
VisDrone20.zip
└───VisDrone20
├───visdrone20.yaml
├───images
│ ├───train
│ └───val
└───labels
├───train
└───val
Test 2
- [x] Successfully uploaded to HUB without errors
- use
path: ../VisDrone20in YAML - no subdirectory in
.zip - use HUB dataset example structure
Details
VisDrone20.yaml
path: ../VisDrone20
train: images/train
val: images/val
test: null
names:
0: pedestrian
1: people
2: bicycle
3: car
4: van
5: truck
6: tricycle
7: awning-tricycle
8: bus
9: motor
VisDrone20.zip structure
VisDrone20.zip
├───visdrone20.yaml
├───images
│ ├───train
│ └───val
└───labels
├───train
└───val
Test 3
- [x] Successfully uploaded to HUB without errors
- use
path: VisDrone20in YAML - no subdirectory in
.zip - use HUB dataset example structure
Details
VisDrone20.yaml
path: ../VisDrone20
train: images/train
val: images/val
test: null
names:
0: pedestrian
1: people
2: bicycle
3: car
4: van
5: truck
6: tricycle
7: awning-tricycle
8: bus
9: motor
VisDrone20.zip structure
VisDrone20.zip
├───visdrone20.yaml
├───images
│ ├───train
│ └───val
└───labels
├───train
└───val
Test 4
- [x] Successfully uploaded to HUB without errors
- use
path: VisDrone20in YAML - includes subdirectory in
.zip - use HUB dataset example structure
Details
VisDrone20.yaml
path: VisDrone20
train: images/train
val: images/val
test: null
names:
0: pedestrian
1: people
2: bicycle
3: car
4: van
5: truck
6: tricycle
7: awning-tricycle
8: bus
9: motor
VisDrone20.zip structure
VisDrone20.zip
└───VisDrone20
├───visdrone20.yaml
├───images
│ ├───train
│ └───val
└───labels
├───train
└───val
Test 5
- [x] Successfully uploaded to HUB without errors
- use
path: VisDrone20in YAML - includes subdirectory in
.zip - use Ultralytics YOLO dataset structure
Details
VisDrone20.yaml
path: VisDrone20
train: train/images
val: val/images
test: null
names:
0: pedestrian
1: people
2: bicycle
3: car
4: van
5: truck
6: tricycle
7: awning-tricycle
8: bus
9: motor
VisDrone20.zip structure
VisDrone20.zip
└───VisDrone20
├───visdrone20.yaml
├───train
│ ├───images
│ └───labels
└───val
├───images
└───labels
Test 6
- [x] Successfully uploaded to HUB without errors
- use
path: ../VisDrone20in YAML - includes subdirectory in
.zip - use Ultralytics YOLO dataset structure
Details
VisDrone20.yaml
path: ../VisDrone20
train: train/images
val: val/images
test: null
names:
0: pedestrian
1: people
2: bicycle
3: car
4: van
5: truck
6: tricycle
7: awning-tricycle
8: bus
9: motor
VisDrone20.zip structure
VisDrone20.zip
└───VisDrone20
├───visdrone20.yaml
├───train
│ ├───images
│ └───labels
└───val
├───images
└───labels
Test 7
- [x] Successfully uploaded to HUB without errors
- use
path: ../VisDrone20in YAML - no subdirectory in
.zip - use Ultralytics YOLO dataset structure
Details
VisDrone20.yaml
path: ../VisDrone20
train: train/images
val: val/images
test: null
names:
0: pedestrian
1: people
2: bicycle
3: car
4: van
5: truck
6: tricycle
7: awning-tricycle
8: bus
9: motor
VisDrone20.zip structure
VisDrone20.zip
├───visdrone20.yaml
├───train
│ ├───images
│ └───labels
└───val
├───images
└───labels
Test 8
- [x] Successfully uploaded to HUB without errors
- use
path: VisDrone_20in YAML - includes subdirectory in
.zip - use Ultralytics YOLO dataset structure
Details
VisDrone20.yaml
path: VisDrone_20
train: train/images
val: val/images
test: null
names:
0: pedestrian
1: people
2: bicycle
3: car
4: van
5: truck
6: tricycle
7: awning-tricycle
8: bus
9: motor
VisDrone20.zip structure
VisDrone_20.zip
├───visdrone20.yaml
├───train
│ ├───images
│ └───labels
└───val
├───images
└───labels
@Burhan-Q To confirm you are no longer seeing any errors?
We have an example of what a dataset should look like, but we also fix datasets with very common and obvious mistakes. The dataset processing happens after it is requested so sometimes it can fail without any reason or crash due to excess memory usage. We are constantly optimizing this.
Yeah I was unable to get an error in testing any of the examples above. I failed to document as thoroughly the attempts I made from yesterday, so it makes it more difficult to pin down the issue. I think these tests cover most variations and all were successful.
@kalenmike is it possible for you to enable verbose logging to my HUB account? Something like "log every action for N hours" so there's a more traceable history for testing? To be clear I'm asking if it's possible, not for a feature add.
@Burhan-Q No, that's not possible.