fiftyone icon indicating copy to clipboard operation
fiftyone copied to clipboard

[FR] CLI, Split Dataset Into Train/Test/Val on Export

Open nmichlo opened this issue 3 years ago • 2 comments

Proposal Summary

The ability to split an existing dataset on export into train/test/eval using the CLI

  • both, deterministically based on some hash of the file or the file name OR randomly

Motivation

Often downloaded data is not split into train/test/val sets. This would allow this to be done easily with the CLI.

What areas of FiftyOne does this feature affect?

  • [ ] App: FiftyOne application
  • [x] Core: Core fiftyone Python library
  • [ ] Server: FiftyOne server

Details

In the summary, two splitting approaches are mentioned that should be implemented:

  1. split randomly based on a PRNG -- This is controlled with a --seed parameter, but if the underlying dataset changes, then the output dataset also changes in unexpected ways.

  2. split based on the hash of the files or the hash of the relative file names to the root of the dataset -- This is advantageous because if data is added to the underlying dataset, then when exporting, the same files are always placed in the same train/test/val split

    • An alternative name for this approach could be sharding
    • To implement this n bins/shards can be created. The size of a split is then calculated as the number of bins assigned to that split over the total number of bins. Images/datapoints can be assigned to a bin based on the numerical value of the hash modulo the number of bins.
    • An example of this can be found in the YT8M download scripts

Willingness to contribute

The FiftyOne Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • [ ] Yes. I can contribute this feature independently.
  • [ ] Yes. I would be willing to contribute this feature with guidance from the FiftyOne community.
  • [x] No. I cannot contribute this feature at this time.

nmichlo avatar Aug 08 '22 21:08 nmichlo

Hi @nmichlo 👋

Just to clarify, this is possible via Python today using random_split().

This method is not exposed in the fiftyone CLI yet, however.

brimoor avatar Aug 11 '22 05:08 brimoor

@brimoor random_split would add items to different splits if more data is added over time, even if a seed is specified?

Maybe a deterministic_split could be added? Where a hash is computed from the sample based on some property and that is used to place the sample in the desired split? Thus if more data is added over time to a dataset, they always get placed in the same location.

Even it there is no desire for this deterministic approach, it would still be great to expose the random approach from the CLI, with support for common formats like YOLO

Some pseudo code:

split_sizes = {
    'train': 80,
    'test': 10,
    'val': 10,
}

total = sum(split_sizes.values())
buckets = [[] for _ in range(total)]

# place each sample in a bucket
for sample in dataset:
    uid = str(sample.some_unique_property)
    hash = hashlib.new('md5', data=uid.encode('utf-8')).hexdigest()
    idx = int(hash, 16) % num_buckets
    buckets[idx].append(sample)

# group the buckets into splits
splits, i = {}, 0
for k, size in split_sizes.items():
    splits[k] = [
        sample
        for bslice in buckets[i:i+size]
        for bucket in bslice
        for sample in bucket
    ]

# return splits

nmichlo avatar Aug 19 '22 12:08 nmichlo

Any updates on this?

rusmux avatar Jan 28 '23 10:01 rusmux