openml-python Proposal: Use pandas str type for str datasets

Since pandas 1.0 there is an explicit string data type which replaces the object datatype: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

I suggest we use it for datasets containing string features as it is more descriptive and the suggest way of representing strings in pandas.

This would for example make the Titanic dataset dtypes much more descriptive. Right now they are:

pclass          uint8
survived     category
name           object
sex          category
age           float64
sibsp           uint8
parch           uint8
ticket         object
fare          float64
cabin          object
embarked     category
boat           object
body          float64
home.dest      object

and they would be

pclass          uint8
survived     category
name           string
sex          category
age           float64
sibsp           uint8
parch           uint8
ticket         string
fare          float64
cabin          string
embarked     category
boat           string
body          float64
home.dest      string

Aug 20 '21 08:08 mfeurer

I just had a short look into this and it turns out this is harder than originally anticipated due to the following reasons:

We currently only distinguish between categorical and numerical features in the internal data loading -> need to extend this
We currently cache this boolean array mentioned in 1. -> need to update what's cached and potentially add something like a cache format version number
There's loading from feather, arff and parquet -> just a bunch of work for feather and arff, not sure about parquet. Is parquet support developed further?

Dec 06 '21 16:12 mfeurer

Thanks for looking into this. Re 3: I think parquet should become the preferred format as openml-python matures its parquet usage.

Dec 06 '21 17:12 PGijsbers