Proposal: Use pandas str type for str datasets
Since pandas 1.0 there is an explicit string data type which replaces the object datatype: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
I suggest we use it for datasets containing string features as it is more descriptive and the suggest way of representing strings in pandas.
This would for example make the Titanic dataset dtypes much more descriptive. Right now they are:
pclass uint8
survived category
name object
sex category
age float64
sibsp uint8
parch uint8
ticket object
fare float64
cabin object
embarked category
boat object
body float64
home.dest object
and they would be
pclass uint8
survived category
name string
sex category
age float64
sibsp uint8
parch uint8
ticket string
fare float64
cabin string
embarked category
boat string
body float64
home.dest string
I just had a short look into this and it turns out this is harder than originally anticipated due to the following reasons:
- We currently only distinguish between categorical and numerical features in the internal data loading -> need to extend this
- We currently cache this boolean array mentioned in 1. -> need to update what's cached and potentially add something like a cache format version number
- There's loading from feather, arff and parquet -> just a bunch of work for feather and arff, not sure about parquet. Is parquet support developed further?
Thanks for looking into this. Re 3: I think parquet should become the preferred format as openml-python matures its parquet usage.