sklearn-pandas
sklearn-pandas copied to clipboard
DataFrameMapper changes columns types when default=None.
When I use DataFrameMapper
and set up default=None
to transform a column, all other columns
types are changed to object
. But this does not happen when I have only float
and/or int
columns
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.impute import SimpleImputer
# all numerical columns lead to no error
da = pd.DataFrame({
'a':[1,3,np.nan],
'b': [1.2,2,3]})
print(da.dtypes)
aux_imp = DataFrameMapper([
(['a'], SimpleImputer(strategy='mean'))],
df_out=True, default=None)
da = aux_imp.fit_transform(da)
print(da.dtypes)
# if a column is of str it leads to errors
da = pd.DataFrame({
'a':[1,3,np.nan],
'b': [1.2,2,3],
'c':['c', 'c', 'a']
})
print(da.dtypes)
aux_imp = DataFrameMapper(
[(['a'], SimpleImputer(strategy='mean'))],
df_out=True, default=None)
da = aux_imp.fit_transform(da)
print(da.dtypes)
I believe this is because the dataframe mapper uses the same "empty transformer" selecting all not explicitly selected columns, therefore if their types are mixed, the best type for the extracted numpy array is "object", to be able to cover strings, ints, floats, etc.
I don't know if this can be worked around by "copying" the default columns one by one, keeping the dtype.
Hi, I'm new to open source contribution. Is is okay for me to work on this issue?
Hello, I would like to work on it Can you please assign it to me
is this issue resolved ?
I am facing the same issue, I have a DataFrame
containing columns of float
and str
dtypes.
using default=None
converts the dtype
of all the columns to object
which is causing my Pipelines to fail.