sklearn-pandas icon indicating copy to clipboard operation
sklearn-pandas copied to clipboard

Dataframe output: Column types depend on the value of default

Open datajanko opened this issue 7 years ago • 11 comments
trafficstars

So the Type of the output column types is the largest class containing all types in every column (typically object)

check_df = pd.DataFrame({'A': [1.0, 2.0], 'B':[1,2], 'C':['A', 'B' ]})
mapper_check= skp.DataFrameMapper([('A', preprocessing.LabelBinarizer())], default=False, df_out=True)
mapper_check.fit_transform(check_df).dtypes
A    int64
dtype: object

now use default= None

mapper_check= skp.DataFrameMapper([('A', preprocessing.LabelBinarizer())], default=None, df_out=True)
mapper_check.fit_transform(check_df).dtypes
A    object
B    object
C    object
dtype: object

So as we see incorporating the default = None changes the type of column A. This is due to the fact, that the stacked arrays only have one type.

So a fix would be to check first if df_out is true and defer the construction of the stacked array

edit: Issue not completely correct: I just build an dtype-transformer: it always construct chooses the type of the column that contains the type of all the other columns

datajanko avatar Feb 05 '18 13:02 datajanko

Sorry for the big delay. Does this issue cause you trouble somewhere else? If you know how to fix this, can you submit a PR with the fix? Thanks!

dukebody avatar Mar 25 '18 15:03 dukebody

No problem. In pipelines, I had the problem, that an estimator did not work with an column of object type containing floats/ints. Those things can happen if you e.g. keep an object column in the first step of a pipeline, and the rest should be floats/ints. Then all columns have object type. This makes it necessary to append a 'to float/int' transformer at the end of a pipeline.

I am not sure, how a good implementation would look like. Structured numpy arrays could be used or simply trying to transform everything to the most common dtype. But the latter does not seem optimal.

datajanko avatar Mar 25 '18 18:03 datajanko

I am working on a PR for this issue and will submit this week.

hacktuarial avatar Apr 16 '18 18:04 hacktuarial

I am seeing the same problem: has this been solved? I see the merge request being accepted but nevertheless I am having the same issue as described in the original question.

gennaro-tedesco avatar Feb 13 '20 15:02 gennaro-tedesco

what version of sklearn-pandas are you using?

hacktuarial avatar Feb 13 '20 19:02 hacktuarial

I am using sklearn-pandas==1.8.0.

gennaro-tedesco avatar Feb 14 '20 09:02 gennaro-tedesco

The following works for me, using sklearn-pandas==1.8.0 installed from PyPI. Can you please provide a code snippet to reproduce your issue?

import sklearn
import pandas as pd
import numpy as np
import sklearn_pandas as skp

if __name__ == "__main__":
    (sklearn.show_versions())
    check_df = pd.DataFrame({"A": [1.0, 2.0], "B": [1, 2], "C": ["A", "B"]})
    mapper_check = skp.DataFrameMapper(
        [("A", sklearn.preprocessing.LabelBinarizer())], default=None, df_out=True
    )
    actual = mapper_check.fit_transform(check_df).dtypes
    expected = pd.Series({"A": np.int64, "B": object, "C": object})
    assert (actual == expected).all()

output


System:
    python: 3.6.6 (default, Sep 20 2018, 23:47:57)  [GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.2)]
executable: /Users/timothysweetser/.pyenv/versions/test/bin/python
   machine: Darwin-19.3.0-x86_64-i386-64bit

Python dependencies:
       pip: 20.0.2
setuptools: 39.0.1
   sklearn: 0.22.1
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: None
    pandas: 1.0.1
matplotlib: None
    joblib: 0.14.1

Built with OpenMP: True

hacktuarial avatar Feb 14 '20 15:02 hacktuarial

Ah, I see the problem now: column B should be integer type

hacktuarial avatar Feb 14 '20 15:02 hacktuarial

Ah, I see the problem now: column B should be integer type

Yes, apparently it keeps the "old" type only when numeric, changing everything else to object; there is another GitHub issue referencing exactly this behaviour (I don't have it with me now). However, another one that can be relevant is here: https://github.com/scikit-learn-contrib/sklearn-pandas/issues/171

gennaro-tedesco avatar Feb 14 '20 16:02 gennaro-tedesco

This is still an issue.

kirel avatar Oct 16 '20 11:10 kirel

sorry for the delay. Let me look into this.

ragrawal avatar May 08 '21 08:05 ragrawal