openml-python icon indicating copy to clipboard operation
openml-python copied to clipboard

Improve dataset upload experience (with ptype and clevercsv?)

Open PGijsbers opened this issue 5 years ago • 4 comments

This feature has also been requested by @prabhant and others that want to use it to build the dataset upload web interface.

The dataset upload experience can be cumbersome, especially if you are inexperienced and working with csv data. We could remove some friction here, by improving on the two most tedious steps.

Loading the dataset could be improved. The default pandas reader can often choke on a wrong separator token or whether or not the csv includes the column header. Clevercsv is a drop-in replacement for Python's csv module but recognizes more dialects automatically. It even reads directly into pandas dataframes: clevercsv.wrapper.read_dataframe("path/to/file.csv"). This should remove the need for the user to be familiar with each of the different parameters of pandas.read_csv.

Another point of friction can be the type annotation. Coming from an unannotated (csv) file, pandas data type interpretation is far from perfect. The ptype module can improve on this type inference:

from ptype.Ptype import Ptype
schema = Ptype().schema_fit(dataframe)
dataframe = schema.transform(dataframe)

Using these packages we can hopefully start uploading datasets with less friction. The proposed usage in the openml-python package would be:

  • expose infer_types that uses the ptype to correct the data types of an already loaded dataframe.
  • integrate both with read_csv_to_dataframe which can be used to load the csv into a dataframe with inferred types, ready for a create_dataset call.

It would add the following dependencies to the project:

I think we should contact the ptype devs to see if they are willing to lower/loosen their requirements.

PGijsbers avatar Oct 21 '20 10:10 PGijsbers

These are great suggestions. I think a lot can be automated or augmented for the user. I do 100% see the value for ML applications. However, I'm wondering whether OpenML is an ML application or rather a 'library of datasets' where the metadata should be 100% correct. That's a discussion we should have at some point. As a first step I would see a notebook showing how to clean up a dataset for upload where a user can interactively check if the data got inferred correctly?

I think we should contact the ptype devs to see if they are willing to lower/loosen their requirements.

Yes, that would be great in any case. Also to learn how much the project is going to be maintained in the future.

mfeurer avatar Oct 21 '20 11:10 mfeurer

I see this as a way to more easily upload datasets in a correct way, hence this fits OpenML perfectly? I also think it will reduce maintenance on our side: less fixing of bad metadata.

joaquinvanschoren avatar Oct 21 '20 11:10 joaquinvanschoren

maybe 'infer_types' should not be exposed but just used as a helper function of 'read_csv_to_dataframe'

joaquinvanschoren avatar Oct 21 '20 11:10 joaquinvanschoren

I'm wondering whether OpenML is an ML application or rather a 'library of datasets' where the metadata should be 100% correct. That's a discussion we should have at some point.

Let's have this discussion then :) next week's workshop seems like a good occasion.

PGijsbers avatar Oct 21 '20 12:10 PGijsbers