fiftyone
fiftyone copied to clipboard
[FR] CLI, Split Dataset Into Train/Test/Val on Export
Proposal Summary
The ability to split an existing dataset on export into train/test/eval using the CLI
- both, deterministically based on some hash of the file or the file name OR randomly
Motivation
Often downloaded data is not split into train/test/val sets. This would allow this to be done easily with the CLI.
What areas of FiftyOne does this feature affect?
- [ ] App: FiftyOne application
- [x] Core: Core
fiftyonePython library - [ ] Server: FiftyOne server
Details
In the summary, two splitting approaches are mentioned that should be implemented:
-
split randomly based on a PRNG -- This is controlled with a
--seedparameter, but if the underlying dataset changes, then the output dataset also changes in unexpected ways. -
split based on the hash of the files or the hash of the relative file names to the root of the dataset -- This is advantageous because if data is added to the underlying dataset, then when exporting, the same files are always placed in the same train/test/val split
- An alternative name for this approach could be sharding
- To implement this
nbins/shards can be created. The size of a split is then calculated as the number of bins assigned to that split over the total number of bins. Images/datapoints can be assigned to a bin based on the numerical value of the hash modulo the number of bins. - An example of this can be found in the YT8M download scripts
Willingness to contribute
The FiftyOne Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
- [ ] Yes. I can contribute this feature independently.
- [ ] Yes. I would be willing to contribute this feature with guidance from the FiftyOne community.
- [x] No. I cannot contribute this feature at this time.
Hi @nmichlo 👋
Just to clarify, this is possible via Python today using random_split().
This method is not exposed in the fiftyone CLI yet, however.
@brimoor random_split would add items to different splits if more data is added over time, even if a seed is specified?
Maybe a deterministic_split could be added? Where a hash is computed from the sample based on some property and that is used to place the sample in the desired split? Thus if more data is added over time to a dataset, they always get placed in the same location.
Even it there is no desire for this deterministic approach, it would still be great to expose the random approach from the CLI, with support for common formats like YOLO
Some pseudo code:
split_sizes = {
'train': 80,
'test': 10,
'val': 10,
}
total = sum(split_sizes.values())
buckets = [[] for _ in range(total)]
# place each sample in a bucket
for sample in dataset:
uid = str(sample.some_unique_property)
hash = hashlib.new('md5', data=uid.encode('utf-8')).hexdigest()
idx = int(hash, 16) % num_buckets
buckets[idx].append(sample)
# group the buckets into splits
splits, i = {}, 0
for k, size in split_sizes.items():
splits[k] = [
sample
for bslice in buckets[i:i+size]
for bucket in bslice
for sample in bucket
]
# return splits
Any updates on this?