openml-python
openml-python copied to clipboard
Performance difference with Pandas and Numpy encoding
Description
OpenML-Python handles data in two primary formats: Pandas DataFrame and Numpy arrays. Pandas retain the original data types of the data columns and the onus is on the user to handle the different types during modelling, For example, using scikit-learn's OneHotEncoder to handle categories. On the other hand for Numpy, all columns are numerically encoded. This difference in encoding leads to minor variations in the final representation of the data, prior to the model building stage. As a result, performance recorded on certain folds may differ marginally.
Steps/Code to Reproduce
This notebook tries to illustrate the difference in performances recorded in the evaluations of a run for certain folds, when running the same flow on the same task but with different dataset formats.
Expected Results
No difference in performance.