openml-python suggestion for dataset upload: check for name conflicts

suggestion for dataset upload: check for name conflicts

Open joaquinvanschoren opened this issue 2 years ago • 1 comments

Description

When submitting a new dataset with the same name as an existing dataset, the OpenML REST API will automatically store it as a new version of that dataset. It would seem nice to do confirm this with the user. The process would be:

Python API checks if a dataset with that name exists
If so, it asks the user whether she wants to store the dataset as a new version or rename the dataset

This is also the procedure for other platforms (e.g., Google Drive)

Is anything like this possible? Maybe fail by default and add an option 'allow_versioning'? Or something less intrusive?

Mar 28 '22 22:03 joaquinvanschoren

Having a check makes sense. On the one hand, I think it makes more sense to have server-side support for this, i.e., the server itself has an allow_versioning parameter in the REST call. openml-python then uses that. This makes the behavior consistent between different openml connectors. It will then also allow other reasons (e.g., checks based on meta-features, description similarity) without each connector having to implement that logic. On the other hand, the current way to upload then might require sending the whole dataset before having the check executed, which seems like a terrible user experience (potentially having to upload it twice). Maybe we should revisit a server-side check in V2?

Is anything like this possible?

Yes, you can get datasets by name in openml-python (or in general, the list_datasets functions allows you to check for this). Adding the check should take no more than a few lines of code.

Mar 29 '22 08:03 PGijsbers

openml-python openml-python copied to clipboard

suggestion for dataset upload: check for name conflicts

Description

openml-python
openml-python copied to clipboard