croissant icon indicating copy to clipboard operation
croissant copied to clipboard

Partition support

Open pierrot0 opened this issue 2 years ago • 0 comments

There is already a howto about splits (https://github.com/mlcommons/croissant/blob/main/docs/howto/specify-splits.md) and an example (https://github.com/mlcommons/croissant/blob/main/datasets/coco2014/metadata.json).

However we also want support for other types of partitions, namely dated partitions and languages (eg: wikipedia).

Currently there is no support for partitions in the validator / loader. We should make sure it is possible to retrieve a single (or a few) partition(s) and only download the required files. We should also make sure it is possible to retrieve many partitions (not just one language for example).

There is no existing howto page for partitions, but I think we need one.

pierrot0 avatar Jun 28 '23 13:06 pierrot0