sklearn-pandas
sklearn-pandas copied to clipboard
Dataframe output: Column types depend on the value of default
So the Type of the output column types is the largest class containing all types in every column (typically object)
check_df = pd.DataFrame({'A': [1.0, 2.0], 'B':[1,2], 'C':['A', 'B' ]})
mapper_check= skp.DataFrameMapper([('A', preprocessing.LabelBinarizer())], default=False, df_out=True)
mapper_check.fit_transform(check_df).dtypes
A int64
dtype: object
now use default= None
mapper_check= skp.DataFrameMapper([('A', preprocessing.LabelBinarizer())], default=None, df_out=True)
mapper_check.fit_transform(check_df).dtypes
A object
B object
C object
dtype: object
So as we see incorporating the default = None changes the type of column A. This is due to the fact, that the stacked arrays only have one type.
So a fix would be to check first if df_out is true and defer the construction of the stacked array
edit: Issue not completely correct: I just build an dtype-transformer: it always construct chooses the type of the column that contains the type of all the other columns
Sorry for the big delay. Does this issue cause you trouble somewhere else? If you know how to fix this, can you submit a PR with the fix? Thanks!
No problem. In pipelines, I had the problem, that an estimator did not work with an column of object type containing floats/ints. Those things can happen if you e.g. keep an object column in the first step of a pipeline, and the rest should be floats/ints. Then all columns have object type. This makes it necessary to append a 'to float/int' transformer at the end of a pipeline.
I am not sure, how a good implementation would look like. Structured numpy arrays could be used or simply trying to transform everything to the most common dtype. But the latter does not seem optimal.
I am working on a PR for this issue and will submit this week.
I am seeing the same problem: has this been solved? I see the merge request being accepted but nevertheless I am having the same issue as described in the original question.
what version of sklearn-pandas are you using?
I am using sklearn-pandas==1.8.0.
The following works for me, using sklearn-pandas==1.8.0 installed from PyPI. Can you please provide a code snippet to reproduce your issue?
import sklearn
import pandas as pd
import numpy as np
import sklearn_pandas as skp
if __name__ == "__main__":
(sklearn.show_versions())
check_df = pd.DataFrame({"A": [1.0, 2.0], "B": [1, 2], "C": ["A", "B"]})
mapper_check = skp.DataFrameMapper(
[("A", sklearn.preprocessing.LabelBinarizer())], default=None, df_out=True
)
actual = mapper_check.fit_transform(check_df).dtypes
expected = pd.Series({"A": np.int64, "B": object, "C": object})
assert (actual == expected).all()
output
System:
python: 3.6.6 (default, Sep 20 2018, 23:47:57) [GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.2)]
executable: /Users/timothysweetser/.pyenv/versions/test/bin/python
machine: Darwin-19.3.0-x86_64-i386-64bit
Python dependencies:
pip: 20.0.2
setuptools: 39.0.1
sklearn: 0.22.1
numpy: 1.18.1
scipy: 1.4.1
Cython: None
pandas: 1.0.1
matplotlib: None
joblib: 0.14.1
Built with OpenMP: True
Ah, I see the problem now: column B should be integer type
Ah, I see the problem now: column B should be integer type
Yes, apparently it keeps the "old" type only when numeric, changing everything else to object; there is another GitHub issue referencing exactly this behaviour (I don't have it with me now). However, another one that can be relevant is here: https://github.com/scikit-learn-contrib/sklearn-pandas/issues/171
This is still an issue.
sorry for the delay. Let me look into this.