hub icon indicating copy to clipboard operation
hub copied to clipboard

Dataset upload structuring

Open Burhan-Q opened this issue 1 year ago • 10 comments

Search before asking

  • [X] I have searched the HUB issues and found no similar bug report.

HUB Component

Datasets

Bug

Dataset structure shown on HUB

image

Tested working dataset structure and YAML file

data
└───data-seg20
        ├───data.yaml
        ├───train
        │     ├───images
        │     └───labels
        └───valid
              ├───images
              └───labels
path: ../data-seg20
train: train/images
val: valid/images
test: null
names:
  0: crack

Also mentioned on https://github.com/ultralytics/hub/issues/569#issuecomment-1953054027

Environment

OS                  Windows-10-10.0.19045-SP0
Environment         Windows
Python              3.11.6
Install             git
RAM                 31.86 GB
CPU                 Intel Core(TM) i5-10600K 4.10GHz
CUDA                12.1

matplotlib          ✅ 3.8.2>=3.3.0
numpy               ✅ 1.26.3>=1.22.2
opencv-python       ✅ 4.9.0.80>=4.6.0
pillow              ✅ 10.2.0>=7.1.2
pyyaml              ✅ 6.0.1>=5.3.1
requests            ✅ 2.31.0>=2.23.0
scipy               ✅ 1.11.4>=1.4.1
torch               ✅ 2.1.1+cu121>=1.8.0
torchvision         ✅ 0.16.1+cu121>=0.9.0
tqdm                ✅ 4.66.1>=4.64.0
psutil              ✅ 5.9.7
py-cpuinfo          ✅ 9.0.0
thop                ✅ 0.1.1-2209072238>=0.1.1
pandas              ✅ 2.1.4>=1.1.4
seaborn             ✅ 0.13.1>=0.11.0

Minimal Reproducible Example

No response

Additional

I attempted again with the tiger pose dataset which uploaded with out issue, but failed due to a Timeout error. Retrying immediately raised a timeout error.

image

Burhan-Q avatar Feb 19 '24 20:02 Burhan-Q

NOTE eventually the tiger pose dataset shows no errors, but I was not observing (or timing) when this occurred.

image

Burhan-Q avatar Feb 19 '24 20:02 Burhan-Q

@Burhan-Q We have error handling in place to manage multiple different formats, but we only suggest the correct one. I am not clear if you are stating that you are not able to upload a dataset format like the example or only that you can upload a dataset formatted differently?

The timeout error suggest that there was an issue connected with the server, we allow a retry option from the dropdown in those cases.

kalenmike avatar Feb 20 '24 07:02 kalenmike

@kalenmike I was only able to upload a dataset with the structure mentioned in the opening comment. It is not possible to upload a dataset using the shown layout, not only did I have this issue it's been experienced by other users (how it was brought to my attention).

With respect to the timeout error, I did attempt a retry and when I did it immediately failed again, but I may not have waited enough time to try again. The timeout error seemingly "resolved itself" as it showed as correctly uploaded some time after uploading with no preventing.

Burhan-Q avatar Feb 20 '24 10:02 Burhan-Q

One thing that was frustrating about the dataset uploading errors is that there is no indication as to what the error is or what the problem might be. This means that if an upload fails, as a user I have no clue why or what to change/fix. Having some kind of report of what errors occurred would be helpful.

Burhan-Q avatar Feb 20 '24 10:02 Burhan-Q

@Burhan-Q There is error reporting, it sounds like you just had the same issue every time. Timeout is no response from the server. We also have:

  • "YAML Not Found."
  • "Multiple YAMLs Found."
  • "Zip Formatted Incorrectly."
  • "Dataset Empty."
  • "YAML Formatting Error."
  • "Processing Error."
  • "Unable to Reach Server."

I may need to run through your issue with you tomorrow.

kalenmike avatar Feb 20 '24 12:02 kalenmike

Also it looks like your dataset did not work because your YAML is not correct. Your YAML is telling us to look back a directory which is why you had to add another directory for it to work.

If you see the example YAML in HUB you will see there is no path key.

image

kalenmike avatar Feb 20 '24 12:02 kalenmike

@kalenmike that's the crazy part, the YAML with path: ../data-seg20 did work for me yesterday.

I decided to do some testing and I'm wondering if something was strange in particular in the last few days because all of the iterations I tested below worked without error. I tested changing the directory structure by varying the presence of a subdirectory in the .zip and by changing the directory layout (I call them out as HUB vs YOLO formats) as well as by varying the use of path: ../VisDrone20 vs path: VisDrone20 with the different dataset layouts.

Retesting 2024-02-20

Test 1

  • [x] Successfully uploaded to HUB without errors
  • use path: ../VisDrone20 in YAML
  • includes subdirectory in .zip
  • use HUB dataset example structure
Details

VisDrone20.yaml

path: ../VisDrone20
train: images/train
val: images/val
test: null

names:
  0: pedestrian
  1: people
  2: bicycle
  3: car
  4: van
  5: truck
  6: tricycle
  7: awning-tricycle
  8: bus
  9: motor

VisDrone20.zip structure

VisDrone20.zip
└───VisDrone20
        ├───visdrone20.yaml
        ├───images
        │     ├───train
        │     └───val
        └───labels
              ├───train
              └───val

Test 2

  • [x] Successfully uploaded to HUB without errors
  • use path: ../VisDrone20 in YAML
  • no subdirectory in .zip
  • use HUB dataset example structure
Details

VisDrone20.yaml

path: ../VisDrone20
train: images/train
val: images/val
test: null

names:
  0: pedestrian
  1: people
  2: bicycle
  3: car
  4: van
  5: truck
  6: tricycle
  7: awning-tricycle
  8: bus
  9: motor

VisDrone20.zip structure

VisDrone20.zip
    ├───visdrone20.yaml
    ├───images
    │     ├───train
    │     └───val
    └───labels
          ├───train
          └───val

Test 3

  • [x] Successfully uploaded to HUB without errors
  • use path: VisDrone20 in YAML
  • no subdirectory in .zip
  • use HUB dataset example structure
Details

VisDrone20.yaml

path: ../VisDrone20
train: images/train
val: images/val
test: null

names:
  0: pedestrian
  1: people
  2: bicycle
  3: car
  4: van
  5: truck
  6: tricycle
  7: awning-tricycle
  8: bus
  9: motor

VisDrone20.zip structure

VisDrone20.zip
    ├───visdrone20.yaml
    ├───images
    │     ├───train
    │     └───val
    └───labels
          ├───train
          └───val

Test 4

  • [x] Successfully uploaded to HUB without errors
  • use path: VisDrone20 in YAML
  • includes subdirectory in .zip
  • use HUB dataset example structure
Details

VisDrone20.yaml

path: VisDrone20
train: images/train
val: images/val
test: null

names:
  0: pedestrian
  1: people
  2: bicycle
  3: car
  4: van
  5: truck
  6: tricycle
  7: awning-tricycle
  8: bus
  9: motor

VisDrone20.zip structure

VisDrone20.zip
└───VisDrone20
        ├───visdrone20.yaml
        ├───images
        │     ├───train
        │     └───val
        └───labels
              ├───train
              └───val

Test 5

  • [x] Successfully uploaded to HUB without errors
  • use path: VisDrone20 in YAML
  • includes subdirectory in .zip
  • use Ultralytics YOLO dataset structure
Details

VisDrone20.yaml

path: VisDrone20
train: train/images
val: val/images
test: null

names:
  0: pedestrian
  1: people
  2: bicycle
  3: car
  4: van
  5: truck
  6: tricycle
  7: awning-tricycle
  8: bus
  9: motor

VisDrone20.zip structure

VisDrone20.zip
└───VisDrone20
        ├───visdrone20.yaml
        ├───train
        │     ├───images
        │     └───labels
        └───val
              ├───images
              └───labels

Test 6

  • [x] Successfully uploaded to HUB without errors
  • use path: ../VisDrone20 in YAML
  • includes subdirectory in .zip
  • use Ultralytics YOLO dataset structure
Details

VisDrone20.yaml

path: ../VisDrone20
train: train/images
val: val/images
test: null

names:
  0: pedestrian
  1: people
  2: bicycle
  3: car
  4: van
  5: truck
  6: tricycle
  7: awning-tricycle
  8: bus
  9: motor

VisDrone20.zip structure

VisDrone20.zip
└───VisDrone20
        ├───visdrone20.yaml
        ├───train
        │     ├───images
        │     └───labels
        └───val
              ├───images
              └───labels

Test 7

  • [x] Successfully uploaded to HUB without errors
  • use path: ../VisDrone20 in YAML
  • no subdirectory in .zip
  • use Ultralytics YOLO dataset structure
Details

VisDrone20.yaml

path: ../VisDrone20
train: train/images
val: val/images
test: null

names:
  0: pedestrian
  1: people
  2: bicycle
  3: car
  4: van
  5: truck
  6: tricycle
  7: awning-tricycle
  8: bus
  9: motor

VisDrone20.zip structure

VisDrone20.zip
        ├───visdrone20.yaml
        ├───train
        │     ├───images
        │     └───labels
        └───val
              ├───images
              └───labels

Test 8

  • [x] Successfully uploaded to HUB without errors
  • use path: VisDrone_20 in YAML
  • includes subdirectory in .zip
  • use Ultralytics YOLO dataset structure
Details

VisDrone20.yaml

path: VisDrone_20
train: train/images
val: val/images
test: null

names:
  0: pedestrian
  1: people
  2: bicycle
  3: car
  4: van
  5: truck
  6: tricycle
  7: awning-tricycle
  8: bus
  9: motor

VisDrone20.zip structure

VisDrone_20.zip
        ├───visdrone20.yaml
        ├───train
        │     ├───images
        │     └───labels
        └───val
              ├───images
              └───labels

Burhan-Q avatar Feb 20 '24 14:02 Burhan-Q

@Burhan-Q To confirm you are no longer seeing any errors?

We have an example of what a dataset should look like, but we also fix datasets with very common and obvious mistakes. The dataset processing happens after it is requested so sometimes it can fail without any reason or crash due to excess memory usage. We are constantly optimizing this.

kalenmike avatar Feb 20 '24 14:02 kalenmike

Yeah I was unable to get an error in testing any of the examples above. I failed to document as thoroughly the attempts I made from yesterday, so it makes it more difficult to pin down the issue. I think these tests cover most variations and all were successful.

@kalenmike is it possible for you to enable verbose logging to my HUB account? Something like "log every action for N hours" so there's a more traceable history for testing? To be clear I'm asking if it's possible, not for a feature add.

Burhan-Q avatar Feb 20 '24 14:02 Burhan-Q

@Burhan-Q No, that's not possible.

kalenmike avatar Feb 22 '24 20:02 kalenmike