ScikitLearn.jl
ScikitLearn.jl copied to clipboard
PyError with OneHotEncoder (Julia 0.6.0 on Windows10)
I'm getting a PyError with this code.
using DataFrames
using ScikitLearn
@sk_import preprocessing: OneHotEncoder
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
mapper = DataFrameMapper([([:B], OneHotEncoder())]);
fit_transform!(mapper, df)
ERROR: PyError (ccall(@pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, arg, C_NULL)) <type 'exceptions.ValueError'>
ValueError('could not convert string to float: M',)
File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\preprocessing\data.py", line 1844, in fit
self.fit_transform(X)
File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\preprocessing\data.py", line 1902, in fit_transform
self.categorical_features, copy=True)
File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\preprocessing\data.py", line 1697, in _transform_selected
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\utils\validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
It seems specific to OneHotEncoder. For example, LabelBinarizer works fine like this:
mapper = DataFrameMapper([(:B, LabelBinarizer())]);
I'm on Windows 10 using Julia 0.6.0. Package versions:
- Conda 0.5.3
- DataArrays 0.5.3
- DataFrames 0.10.0
- PyCall 1.14.0
- ScikitLearn 0.3.0
- ScikitLearnBase 0.3.0
I let ScikitLearn.jl automatically handle the installation of Python dependencies. The installed versions are:
python 2.7.13
numpy 1.13.0
scikit-learn 0.18.2
It's probably a bug, but have you checked if the equivalent code works in Python?
You can use ScikitLearn.Preprocessing.DictEncoder()
until this gets fixed. The semantics are a bit different, but for single-column input matrices it should be the same:
DictEncoder()
For every unique row in the input matrix, associate a 0/1 binary column in the output matrix. This is similar to OneHotEncoder, but considers the entire row as a single value for one-hot-encoding. It works with any hashable datatype.
It is common to use it inside a DataFrameMapper, with a particular subset of the columns.
Thank you for the detailed bug report!
Sorry, my mistake. Turns out OneHotEncoder only accepts integer values. Rather unexpected and weird in my opinion but clearly stated in the docs. At least I'm not the only one: https://github.com/pandas-dev/sklearn-pandas/issues/63. : )
However, I still get an 'invalid Array dimensions' error with this code
using DataFrames
using ScikitLearn
@sk_import preprocessing: OneHotEncoder
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
mapper = DataFrameMapper([([:A], OneHotEncoder())]);
fit_transform!(mapper, df)
invalid Array dimensions
Stacktrace:
[1] Array{Float64,N} where N(::Tuple{Int64}) at .\boot.jl:317
[2] py2array(::Type{T} where T, ::PyCall.PyObject) at C:\Users\...\.julia\v0.6\PyCall\src\conversions.jl:381
[3] convert(::Type{Array{Float64,2}}, ::PyCall.PyObject) at C:\Users\...\.julia\v0.6\PyCall\src\numpy.jl:480
[4] transform(::ScikitLearn.DataFrameMapper, ::DataFrames.DataFrame) at C:\Users\...\.julia\v0.6\ScikitLearn\src\dataframes.jl:150
[5] #fit_transform!#16(::Array{Any,1}, ::Function, ::ScikitLearn.DataFrameMapper, ::DataFrames.DataFrame, ::Void) at C:\Users\...\.julia\v0.6\ScikitLearnBase\src\ScikitLearnBase.jl:152
[6] fit_transform!(::ScikitLearn.DataFrameMapper, ::DataFrames.DataFrame) at C:\Users\...\.julia\v0.6\ScikitLearnBase\src\ScikitLearnBase.jl:152
although this code in Python works fine
import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'A': [1,2,3,4], 'B': ["M", "F", "F", "M"]})
mapper = DataFrameMapper([(['A'], OneHotEncoder())])
mapper.fit_transform(df)
Fortunately the change to OneHotEncoder for accepting strings is in the works: https://github.com/scikit-learn/scikit-learn/issues/4920
Figured it out; OneHotEncoder
returns a sparse matrix by default, which PyCall doesn't know how to convert (https://github.com/JuliaPy/PyCall.jl/issues/204). It would have to be fixed there. Or at the very least, there should be a clearer error message on that end.
Fortunately, you can solve the problem with OneHotEncoder(sparse=false)
.
Turns out OneHotEncoder only accepts integer values
Use DictEncoder
! It's pure Julia, so it'll be way faster than OneHotEncoder
, and it will work with any hashable type (almost anything).
Thank you!
Use DictEncoder!
Will do.
Hopefully we can soon replace all the preprocessing steps with pure Julia implementations. The work at JuliaML seems to get there step-by-step.