[ENH] `BaseDatasetCollection` Class for Grouping Datasets
Currently when reproducing a benchmark run, the user needs to import the datasets individually, e.g.,
from sktime.datasets import load_unit_test, load_arrow_head
This method by design is not scalable and is inefficeint for reproducing benchmarks. Let's take the example of bake off redux, this benchmarking study uses 112 datasets, importing them and adding them individually to the benchmark would be painstaking.
Proposed Solution:
BaseDatasetCollection Class:
BaseDatasetCollection class that defines a consistent interface for dataset collections, extending BaseObject. This class will:
- Act as a container for a "group" of datasets.
- Provide a
.load()method returning a dict or (list?) of dataset objects. - Provide a
.get_names()method to list all dataset names present in the collection. - Include metadata or tags containing the dataset collection information, for example, the source of the collection (research paper or lab), scitypes and etc.
Implementing DatasetCollection Classes:
Contributors can extend the base class to create collections of datasets, for example, TSCBakeoffCollection. Whenever a substantial study is done, contributors can create these collection classes grouping the dataset from the study, just how they add estimators.
Example:
from sktime.datasets import load_unit_test, load_arrow_head
class SomeCollection(BaseDatasetCollection):
def load(self):
return {
"unit_test": load_unit_test,
"arrow_head": load_arrow_head,
}
Very interesting idea! I agree that "collection of datasets" is a very important concept that we were unable to cover so far.
Question: have you thought about alternative designs for this, and how this impacts usage (and ease of usage, understanding, maintenance, etc)?
Off the top of my head, I could also think about adding a "catch-all tag" which can be a list of anything. You could filter this by, say, TSC-bakeoff-2018-dataset. This design does not require adding more base classes, but is it really better? I am still thinking about pros and cons here.
A third design I can think of is a "polymorphic flyweight" https://refactoring.guru/design-patterns/flyweight - e.g., extending the current dataset class to be able to represent both a single dataset and many. However it feels very fuzzy how this would work, e.g., via the tags that will differ by dataset if there are many.
I could also think about adding a "catch-all tag" which can be a list of anything. You could filter this by, say, TSC-bakeoff-2018-dataset.
Hmm, this is an interesting idea too. it will be lesser boilerplate than the proposed base class but I suppose with this approach we can't really store the metadata of the collection. Other than that, I feel like the tag based approach is a bit ad-hoc because we need to use loop and stuff while the former approach feels cleaner?
This design does not require adding more base classes
Is adding more base class a bad practice? If so, why?
how this impacts usage (and ease of usage, understanding, maintenance, etc)?
Well, we can write an extension template, which will make it easy to contribute new "collections". These collections can be documented.
I also think tag-based approach will be more difficult to manage, whenever a new study will be done or a new collection will be needed, people will have to add tags to all the datasets making up that collection while in the class based approach, they can be added all at once.
@jgyasu I was also wondering if using tags in the dataset classes to categorize them, and a method to retrieve the dataset classes based on tags, wouldn't do the job? In addition, what would be the advantage of having a new class as container for collection of datasets, vs a dictionary of datasets?
Is adding more base class a bad practice? If so, why?
Not bad practice per se, but a "contra" that goes one way of the scale. If we can avoid more base classes, all other things equal, that is better. But properties of approaches are connected, so overall it could be the best solution in a careful trade-off.
In addition, what would be the advantage of having a new class as container for collection of datasets, vs a dictionary of datasets?
It would maybe make maintaining, managing, and searching for collections easier? If there are more than, say, 3.
I pasted this conversation to GPT and asked to weigh down the pros and cons:
Summary Table
| Aspect | Tag-Based Approach | Class-Based Approach |
|---|---|---|
| Boilerplate | Minimal | More, but manageable |
| Ease of maintenance | Fragile for large sets | Centralized and structured |
| Metadata handling | Poor for collection-wide | Great for collection-wide |
| Scalability | Tedious for large sets | Designed for scale |
| Ease of usage | Requires filtering | Intuitive .load() interface |
| Discoverability | Needs external listing | Built into the class |
| Extensibility | Indirect (edit tags) | Easy (add new class) |
Suggested Hybrid
You might even combine both:
- Use the tag-based system for basic filtering, especially for simple or emergent groupings.
- Use the class-based system for official, documented collections like benchmarking studies, published challenges, etc.
That way, you avoid bloating either system while benefiting from the strengths of both.
I pasted this conversation to GPT and asked to weigh down the pros and cons:
I think it is mostly rephrasing the arguments that were already flagged... but the summary is not wrong.
It also does not look at the "single class" design, where a dataset class is polymorphic - single data table or a collection.
"single class" design, where a dataset class is polymorphic - single data table or a collection.
hmm, interesting, by this you mean we can just have a single dataset collection class and different methods under it representing a collection and returning the dict of datasets? but again we won't be able to store the metadata of a dataset collection
I saw sktime already has two files: tsc_dataset_names and tsf_dataset_names which has lists of datasets, I was able to use it to simplify importing and adding datasets to benchmark run. See below,
from sktime.datasets.tsc_dataset_names import multivariate_equal_length #TSC Bakeoff uses datasets present in multivariate_equal_length
datasets = []
datasets.extend(multivariate_equal_length)
dataset_loaders = []
for dataset in datasets:
dataset_loaders.append(load_UCR_UEA_dataset(name=dataset))
for dataset_loader in dataset_loaders:
benchmark.add_task(
dataset_loader,
cv_splitter,
scorer,
)
I feel like lists like these can do the job?
internally perhaps, but it requires a lot of "piecing together". Also, what it you only want univariate, or unequal length datasets? No way to get that metadata easily.
E.g., the current UCR repository has additional multivariate examples, which this batch did not yet have.