datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Add note about loading image / audio files to docs

Open lewtun opened this issue 1 year ago • 7 comments

This PR adds a small note about how to load image / audio datasets that have multiple splits in their dataset structure.

Related forum thread: https://discuss.huggingface.co/t/loading-train-and-test-splits-with-audiofolder/22447

cc @NielsRogge

lewtun avatar Sep 02 '22 10:09 lewtun

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Thanks for the feedback @polinaeterna ! I've reworded the docs a bit to integrate your comments and this should be ready for another review :)

lewtun avatar Sep 05 '22 09:09 lewtun

I've just realized that there is another PR about audio documentation open: #4872 and there the more detailed description on how to use audiofolder is moved to another section ("Create an audio dataset")

Ah yes, let's add a comment to #4872 - that will be simpler than the alternatives :)

lewtun avatar Sep 07 '22 10:09 lewtun

@polinaeterna @lhoestq What do you think about adding support for the metadata format from Kaggle (one metadata file for each split with the name equal to the split name) to ImageFolder/AudioFolder? I also think we can relax some requirements a bit by:

  • allowing filename as the name of the main metadata column (currently, only file_path is allowed)
  • not requiring that the features of all the given metadata files are equal. Instead, we can have a soft check by using _check_if_features_can_be_aligned + _align_features. The rationale is that train/val metadata often has extra columns compared to test metadata.

These changes would allow us to load the Kaggle dataset linked in the forum thread without any "interventions".

PS: this metadata format for ImageFolder was also proposed by @abhishekkrthakur initially.

mariosasko avatar Sep 14 '22 17:09 mariosasko

Can you give more details about the Kaggle format ? I'm down to discuss it in a separate issue if you don't mind.

allowing filename as the name of the main metadata column (currently, only file_path is allowed)

filename refers to the name of the file, so there's no logic about relative path or directories. If I recall correctly this is what we're doing right now so why not

not requiring that the features of all the given metadata files are equal. Instead, we can have a soft check by using _check_if_features_can_be_aligned + _align_features. The rationale is that train/val metadata often has extra columns compared to test metadata.

+1 and we can set to None the missing features

lhoestq avatar Sep 15 '22 15:09 lhoestq

I'm not sure if this is worth opening a new issue :).

What I mean by the Kaggle format is the structure like this one (the name of a metadata file is equal to the directory it "references"):

- train
    - img1.jpeg
    - img2.jpeg
    - ...
- test
    - img1.jpeg
    - img2.jpeg
    - ...   
- train.csv
- test.csv

mariosasko avatar Sep 16 '22 16:09 mariosasko

Sounds nice !

lhoestq avatar Sep 16 '22 16:09 lhoestq

@mariosasko +1 to allowing different features set and metadata filenames corresponding to split names

Considering filename column - right now it's even called file_name now, which is not nice because in fact it's a relative file path indeed, so I think it should be file_path (and I don't know why I haven't thought about it before the release...)

polinaeterna avatar Sep 23 '22 13:09 polinaeterna

@lewtun don't you mind if I close this pull request as I've integrated your changes in https://github.com/huggingface/datasets/pull/4872 ? (it doesn't have a link to a kaggle example though)

polinaeterna avatar Sep 23 '22 13:09 polinaeterna