free-spoken-digit-dataset icon indicating copy to clipboard operation
free-spoken-digit-dataset copied to clipboard

Consider spreading the data into multiple directories

Open cesarsouza opened this issue 7 years ago • 5 comments

Right now the entire dataset is contained in a single directory (https://github.com/Jakobovski/free-spoken-digit-dataset/tree/master/recordings). This will not scale once the dataset becomes larger. Depending on the file system, even listing the directory contents with ls can become burdensome after around 10,000 files.

But another reason to do so is that the current layout may prevent the files from being queried using GitHub's developer API in the future. I am building an interface to the dataset that can automatically download, query and organize the dataset into training and testing sets without having to first clone the dataset using git. However, there is a limit on the number of files that can be retrieved using this API, and after this limit, the only method would be to clone the repository and retrieve the files manually.

Regards, Cesar

cesarsouza avatar Oct 12 '17 19:10 cesarsouza

How would you like to organize the recordings?

Jakobovski avatar Oct 13 '17 08:10 Jakobovski

I would say that the simplest way would be to organize them hierarchically as recordings/<digit>/<speaker>/<digit>_<speaker>_<variation>.wav.

cesarsouza avatar Oct 13 '17 18:10 cesarsouza

+1, are the files also added using git lfs?

Mistobaan avatar Aug 14 '19 19:08 Mistobaan

+1, I suggest a more general structure commonly used in many computer vision datasets (like ImageNet), as: recordings/<digit>/<speaker>_<variation>.wav, following the structure <data_root>/<class_label>/<id>.<ext>.

dansuh17 avatar Oct 02 '19 07:10 dansuh17

@dansuh17 Feel free to contribute and I will accept the MR

Jakobovski avatar Oct 02 '19 07:10 Jakobovski