lolo icon indicating copy to clipboard operation
lolo copied to clipboard

Categorical input support for lolopy

Open sesevgen opened this issue 5 years ago • 5 comments

I might be mistaken, but lolopy does not seem to support categorical inputs. Input of categorical features fails in utils.py with an attempted cast of X to np.float64. @WardLT

If there's a set way of providing categoricals to lolopy, it'd be useful to document or provide an example.

sesevgen avatar Dec 04 '19 18:12 sesevgen

Could you provide a stack trace? We do have support for using lolo's random forest for classification with RandomForestClassifer

WardLT avatar Dec 04 '19 19:12 WardLT

Just to clarify, I meant using a categorical as one of the input dimensions. For example: X = [['a', 1.0, 2.0], ['b', 1.5, 2.2], ...] and y = [5.5, 6.7, ...]

for rf=RandomForestRegressor(), where I'm trying rf.fit(X,y). Sorry if this was not intended usage.

sesevgen avatar Dec 04 '19 20:12 sesevgen

Oh, I misunderstood your question, sorry!

Correct, lolopy does not support categorical imports. How does the underlying methods in lolo handle them?

WardLT avatar Dec 04 '19 20:12 WardLT

Ok, thanks for clarifying! I don't really know the scala side. There is an encoder written by @maxhutch. Happy to try to (eventually) figure it out and submit a PR to add support to lolopy though.

sesevgen avatar Dec 04 '19 20:12 sesevgen

@WardLT it handles them seamlessly by encoding them into Char (only up to 256 categories are supported) and then having a special splitter for them.

The trick is going to be sending a Vector[Any], where some of those Any are Double and some of them are objects. In lolo, they don't even have to be strings: https://github.com/CitrineInformatics/lolo/blob/develop/src/main/scala/io/citrine/lolo/trees/regression/RegressionTree.scala#L45

maxhutch avatar Dec 04 '19 21:12 maxhutch