ScikitLearn.jl icon indicating copy to clipboard operation
ScikitLearn.jl copied to clipboard

PyError with OneHotEncoder (Julia 0.6.0 on Windows10)

Open ValdarT opened this issue 7 years ago • 6 comments

I'm getting a PyError with this code.

using DataFrames
using ScikitLearn
@sk_import preprocessing: OneHotEncoder

df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])

mapper = DataFrameMapper([([:B], OneHotEncoder())]);

fit_transform!(mapper, df)
ERROR: PyError (ccall(@pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, arg, C_NULL)) <type 'exceptions.ValueError'>
ValueError('could not convert string to float: M',)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\preprocessing\data.py", line 1844, in fit
    self.fit_transform(X)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\preprocessing\data.py", line 1902, in fit_transform
    self.categorical_features, copy=True)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\preprocessing\data.py", line 1697, in _transform_selected
    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\utils\validation.py", line 382, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)

It seems specific to OneHotEncoder. For example, LabelBinarizer works fine like this:

mapper = DataFrameMapper([(:B, LabelBinarizer())]);

I'm on Windows 10 using Julia 0.6.0. Package versions:

- Conda                         0.5.3
- DataArrays                    0.5.3
- DataFrames                    0.10.0
- PyCall                        1.14.0
- ScikitLearn                   0.3.0
- ScikitLearnBase               0.3.0

I let ScikitLearn.jl automatically handle the installation of Python dependencies. The installed versions are:

python                    2.7.13
numpy                     1.13.0
scikit-learn              0.18.2

ValdarT avatar Jul 06 '17 13:07 ValdarT

It's probably a bug, but have you checked if the equivalent code works in Python?

You can use ScikitLearn.Preprocessing.DictEncoder() until this gets fixed. The semantics are a bit different, but for single-column input matrices it should be the same:

DictEncoder()

For every unique row in the input matrix, associate a 0/1 binary column in the output matrix. This is similar to OneHotEncoder, but considers the entire row as a single value for one-hot-encoding. It works with any hashable datatype.

It is common to use it inside a DataFrameMapper, with a particular subset of the columns.

cstjean avatar Jul 06 '17 13:07 cstjean

Thank you for the detailed bug report!

cstjean avatar Jul 06 '17 13:07 cstjean

Sorry, my mistake. Turns out OneHotEncoder only accepts integer values. Rather unexpected and weird in my opinion but clearly stated in the docs. At least I'm not the only one: https://github.com/pandas-dev/sklearn-pandas/issues/63. : )

However, I still get an 'invalid Array dimensions' error with this code

using DataFrames
using ScikitLearn
@sk_import preprocessing: OneHotEncoder

df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])

mapper = DataFrameMapper([([:A], OneHotEncoder())]);

fit_transform!(mapper, df)
invalid Array dimensions

Stacktrace:
 [1] Array{Float64,N} where N(::Tuple{Int64}) at .\boot.jl:317
 [2] py2array(::Type{T} where T, ::PyCall.PyObject) at C:\Users\...\.julia\v0.6\PyCall\src\conversions.jl:381
 [3] convert(::Type{Array{Float64,2}}, ::PyCall.PyObject) at C:\Users\...\.julia\v0.6\PyCall\src\numpy.jl:480
 [4] transform(::ScikitLearn.DataFrameMapper, ::DataFrames.DataFrame) at C:\Users\...\.julia\v0.6\ScikitLearn\src\dataframes.jl:150
 [5] #fit_transform!#16(::Array{Any,1}, ::Function, ::ScikitLearn.DataFrameMapper, ::DataFrames.DataFrame, ::Void) at C:\Users\...\.julia\v0.6\ScikitLearnBase\src\ScikitLearnBase.jl:152
 [6] fit_transform!(::ScikitLearn.DataFrameMapper, ::DataFrames.DataFrame) at C:\Users\...\.julia\v0.6\ScikitLearnBase\src\ScikitLearnBase.jl:152

although this code in Python works fine

import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'A': [1,2,3,4], 'B': ["M", "F", "F", "M"]})
mapper = DataFrameMapper([(['A'], OneHotEncoder())])

mapper.fit_transform(df)

ValdarT avatar Jul 07 '17 12:07 ValdarT

Fortunately the change to OneHotEncoder for accepting strings is in the works: https://github.com/scikit-learn/scikit-learn/issues/4920

ValdarT avatar Jul 07 '17 12:07 ValdarT

Figured it out; OneHotEncoder returns a sparse matrix by default, which PyCall doesn't know how to convert (https://github.com/JuliaPy/PyCall.jl/issues/204). It would have to be fixed there. Or at the very least, there should be a clearer error message on that end.

Fortunately, you can solve the problem with OneHotEncoder(sparse=false).

Turns out OneHotEncoder only accepts integer values

Use DictEncoder! It's pure Julia, so it'll be way faster than OneHotEncoder, and it will work with any hashable type (almost anything).

cstjean avatar Jul 07 '17 12:07 cstjean

Thank you!

Use DictEncoder!

Will do.

Hopefully we can soon replace all the preprocessing steps with pure Julia implementations. The work at JuliaML seems to get there step-by-step.

ValdarT avatar Jul 07 '17 13:07 ValdarT