sktime icon indicating copy to clipboard operation
sktime copied to clipboard

[ENH] `BaseDatasetCollection` Class for Grouping Datasets

Open jgyasu opened this issue 6 months ago • 9 comments

Currently when reproducing a benchmark run, the user needs to import the datasets individually, e.g.,

from sktime.datasets import load_unit_test, load_arrow_head

This method by design is not scalable and is inefficeint for reproducing benchmarks. Let's take the example of bake off redux, this benchmarking study uses 112 datasets, importing them and adding them individually to the benchmark would be painstaking.

Proposed Solution:

BaseDatasetCollection Class:

BaseDatasetCollection class that defines a consistent interface for dataset collections, extending BaseObject. This class will:

  • Act as a container for a "group" of datasets.
  • Provide a .load() method returning a dict or (list?) of dataset objects.
  • Provide a .get_names() method to list all dataset names present in the collection.
  • Include metadata or tags containing the dataset collection information, for example, the source of the collection (research paper or lab), scitypes and etc.

Implementing DatasetCollection Classes:

Contributors can extend the base class to create collections of datasets, for example, TSCBakeoffCollection. Whenever a substantial study is done, contributors can create these collection classes grouping the dataset from the study, just how they add estimators.

Example:

from sktime.datasets import load_unit_test, load_arrow_head

class SomeCollection(BaseDatasetCollection):
    def load(self):
        return {
            "unit_test": load_unit_test,
            "arrow_head": load_arrow_head,
        }

jgyasu avatar Jun 13 '25 15:06 jgyasu

Very interesting idea! I agree that "collection of datasets" is a very important concept that we were unable to cover so far.

Question: have you thought about alternative designs for this, and how this impacts usage (and ease of usage, understanding, maintenance, etc)?

Off the top of my head, I could also think about adding a "catch-all tag" which can be a list of anything. You could filter this by, say, TSC-bakeoff-2018-dataset. This design does not require adding more base classes, but is it really better? I am still thinking about pros and cons here.

A third design I can think of is a "polymorphic flyweight" https://refactoring.guru/design-patterns/flyweight - e.g., extending the current dataset class to be able to represent both a single dataset and many. However it feels very fuzzy how this would work, e.g., via the tags that will differ by dataset if there are many.

fkiraly avatar Jun 13 '25 16:06 fkiraly

I could also think about adding a "catch-all tag" which can be a list of anything. You could filter this by, say, TSC-bakeoff-2018-dataset.

Hmm, this is an interesting idea too. it will be lesser boilerplate than the proposed base class but I suppose with this approach we can't really store the metadata of the collection. Other than that, I feel like the tag based approach is a bit ad-hoc because we need to use loop and stuff while the former approach feels cleaner?

This design does not require adding more base classes

Is adding more base class a bad practice? If so, why?

how this impacts usage (and ease of usage, understanding, maintenance, etc)?

Well, we can write an extension template, which will make it easy to contribute new "collections". These collections can be documented.

I also think tag-based approach will be more difficult to manage, whenever a new study will be done or a new collection will be needed, people will have to add tags to all the datasets making up that collection while in the class based approach, they can be added all at once.

jgyasu avatar Jun 13 '25 16:06 jgyasu

@jgyasu I was also wondering if using tags in the dataset classes to categorize them, and a method to retrieve the dataset classes based on tags, wouldn't do the job? In addition, what would be the advantage of having a new class as container for collection of datasets, vs a dictionary of datasets?

felipeangelimvieira avatar Jun 13 '25 16:06 felipeangelimvieira

Is adding more base class a bad practice? If so, why?

Not bad practice per se, but a "contra" that goes one way of the scale. If we can avoid more base classes, all other things equal, that is better. But properties of approaches are connected, so overall it could be the best solution in a careful trade-off.

In addition, what would be the advantage of having a new class as container for collection of datasets, vs a dictionary of datasets?

It would maybe make maintaining, managing, and searching for collections easier? If there are more than, say, 3.

fkiraly avatar Jun 13 '25 19:06 fkiraly

I pasted this conversation to GPT and asked to weigh down the pros and cons:

Summary Table

Aspect Tag-Based Approach Class-Based Approach
Boilerplate Minimal More, but manageable
Ease of maintenance Fragile for large sets Centralized and structured
Metadata handling Poor for collection-wide Great for collection-wide
Scalability Tedious for large sets Designed for scale
Ease of usage Requires filtering Intuitive .load() interface
Discoverability Needs external listing Built into the class
Extensibility Indirect (edit tags) Easy (add new class)

Suggested Hybrid

You might even combine both:

  • Use the tag-based system for basic filtering, especially for simple or emergent groupings.
  • Use the class-based system for official, documented collections like benchmarking studies, published challenges, etc.

That way, you avoid bloating either system while benefiting from the strengths of both.

jgyasu avatar Jun 14 '25 06:06 jgyasu

I pasted this conversation to GPT and asked to weigh down the pros and cons:

I think it is mostly rephrasing the arguments that were already flagged... but the summary is not wrong.

It also does not look at the "single class" design, where a dataset class is polymorphic - single data table or a collection.

fkiraly avatar Jun 14 '25 08:06 fkiraly

"single class" design, where a dataset class is polymorphic - single data table or a collection.

hmm, interesting, by this you mean we can just have a single dataset collection class and different methods under it representing a collection and returning the dict of datasets? but again we won't be able to store the metadata of a dataset collection

jgyasu avatar Jun 14 '25 09:06 jgyasu

I saw sktime already has two files: tsc_dataset_names and tsf_dataset_names which has lists of datasets, I was able to use it to simplify importing and adding datasets to benchmark run. See below,

from sktime.datasets.tsc_dataset_names import multivariate_equal_length #TSC Bakeoff uses datasets present in multivariate_equal_length

datasets = []
datasets.extend(multivariate_equal_length)

dataset_loaders = []
for dataset in datasets:
    dataset_loaders.append(load_UCR_UEA_dataset(name=dataset))

for dataset_loader in dataset_loaders:
    benchmark.add_task(
        dataset_loader,
        cv_splitter,
        scorer,
    )

I feel like lists like these can do the job?

jgyasu avatar Jun 17 '25 12:06 jgyasu

internally perhaps, but it requires a lot of "piecing together". Also, what it you only want univariate, or unequal length datasets? No way to get that metadata easily.

E.g., the current UCR repository has additional multivariate examples, which this batch did not yet have.

fkiraly avatar Jun 17 '25 12:06 fkiraly