free-spoken-digit-dataset
free-spoken-digit-dataset copied to clipboard
Consider spreading the data into multiple directories
Right now the entire dataset is contained in a single directory (https://github.com/Jakobovski/free-spoken-digit-dataset/tree/master/recordings). This will not scale once the dataset becomes larger. Depending on the file system, even listing the directory contents with ls
can become burdensome after around 10,000 files.
But another reason to do so is that the current layout may prevent the files from being queried using GitHub's developer API in the future. I am building an interface to the dataset that can automatically download, query and organize the dataset into training and testing sets without having to first clone the dataset using git. However, there is a limit on the number of files that can be retrieved using this API, and after this limit, the only method would be to clone the repository and retrieve the files manually.
Regards, Cesar
How would you like to organize the recordings?
I would say that the simplest way would be to organize them hierarchically as recordings/<digit>/<speaker>/<digit>_<speaker>_<variation>.wav
.
+1, are the files also added using git lfs?
+1, I suggest a more general structure commonly used in many computer vision datasets (like ImageNet), as: recordings/<digit>/<speaker>_<variation>.wav
, following the structure <data_root>/<class_label>/<id>.<ext>
.
@dansuh17 Feel free to contribute and I will accept the MR