OpenML icon indicating copy to clipboard operation
OpenML copied to clipboard

Parquet, categoricals and data-types

Open sebffischer opened this issue 2 years ago • 0 comments

I have noticed that there are some differences between the parquet and the arff files (e.g. the classes integer and double can be different between the two formats), furthermore the arrow-reader uses non-standard metadata to encode the categoricals (see this issue: https://github.com/duckdb/duckdb/issues/3309#issuecomment-1087755900). The arrow library however is really unusable in R (multiple people reported that), I am not sure how it would be in julia or Java (?) Also the "features" metadata currently does not provide enough information to ensure that the parsed arff files and the parsed parquet files are really identical (by converting the columns)

sebffischer avatar Apr 05 '22 11:04 sebffischer