OpenML icon indicating copy to clipboard operation
OpenML copied to clipboard

Proposal to include `numberOfInstances` and `numberOfFeatures` qualities in the dataset description

Open PGijsbers opened this issue 3 years ago • 3 comments

The dataset description.xml contains some of the most useful meta-data of the dataset. I think the number of instances/rows and the number of features should be added here. Those features generally tend to be of the most interest (e.g. making a natural inclusion in openml-python's dataset representation), but requires an additional download which incurs user wait time and strains the server. There's already a precedent for including including features that directly reference the data (e.g. default_target_attribute, ignore_attribute and row_id_attribute), at the same time I realize we want to be careful about slowly creating one monolithic file. The specific use case that lead me to consider this is that the automl benchmark downloads qualities only to obtain the dataset dimensions. What do you think?

PGijsbers avatar May 14 '21 10:05 PGijsbers

If it reduces the number of API calls this would be useful (even if the API call is a tiny bit slower). If we do this it doesn't really matter if we add 2 fields or a few more. E.g. the number of classes may also be useful? What should the return value be? Something like a parent tag 'qualities' and below that name-value pairs as children?

@sahithyaravi1493 could you check what the speed impact is? @janvanrijn any comments on this?

joaquinvanschoren avatar May 18 '21 16:05 joaquinvanschoren

Why don't you use the dataset list function?

You can get there all tasks/datasets attached to a study, and it contains some important qualities (if available)

janvanrijn avatar May 18 '21 16:05 janvanrijn

In the case of the automl benchmark, we actually approach the dataset through its task (we know the task id). So using the list_datasets or getting the qualities directly both require an extra query. That said, it looks like I had actually misunderstood some code in the benchmark and I think we can work around this limitation now. It does require an update to openml-python as it is still downloading data too eagerly.

Thanks everyone for the insight/discussion. I think it would still be interesting to know the effect it has on query time, but I see no reason to actually go forward with this proposal at this point.

PGijsbers avatar May 19 '21 12:05 PGijsbers