category_encoders icon indicating copy to clipboard operation
category_encoders copied to clipboard

Intercept in Contrast Coding Schemes

Open PaulWestenthanner opened this issue 2 years ago • 7 comments

Expected Behavior

The constant (all values 1) intercept column should not be added when applying contrast coding schemes (i.e. backward difference, sum, polynomial and helmert coding)

I don't think this intercept column is needed. If you fit a supervised learning model it is probably gonna help to remove the intercept column. I think it is there because when fitting linear models with statsmodels you have to add the intercept.
However I don't like that the output of an encoder would then depend on whether the intercept column is already there or not, e.g. if I first apply encoder A on column A and then encoder B on column B the intercept column of B overwrite A's intercept column hence not adding a new column. Also if I have (for some reason) a column called intercept that is not constant it would get overwritten.

Any opinion? Am I missing something? Is the intercept necessary?

Actual Behavior

A constant column with all values 1 is added

Steps to Reproduce the Problem

Run transform on any fitted contrast coding encoder, e.g.

        train = ['A', 'B', 'C']
        encoder = encoders.BackwardDifferenceEncoder(handle_unknown='value', handle_missing='value')
        encoder.fit_transform(train)

PaulWestenthanner avatar Sep 26 '22 14:09 PaulWestenthanner

Can intercept be added as a class parameter? If so, then this is the way to go, imo. Then these classes could be tested with different values of intercept to catch errors and bugs.

glevv avatar Nov 06 '22 15:11 glevv

Yes we could I think, that should be rather straight forward as well. Would you set with_intercept=True as a default for backwards compatibility or not (which might be more correct)?

PaulWestenthanner avatar Nov 06 '22 19:11 PaulWestenthanner

Yes we could I think, that should be rather straight forward as well. Would you set with_intercept=True as a default for backwards compatibility or not (which might be more correct)?

Yep, I would set it to True to keep the default behavior intact

glevv avatar Nov 06 '22 20:11 glevv

Is this one still relevant?

glevv avatar Jan 27 '23 08:01 glevv

I think it is, I'm slightly in favor of setting with_intercept to False as I don't think many people will need the intercept.
However I'd like to keep this issue open for a little while and hope for more voices from the community, maybe from actual users of the contrast coding schemes. I feel a little uneasy to take this decision by myself without further discussion

PaulWestenthanner avatar Jan 27 '23 12:01 PaulWestenthanner

user's perspective: imo encoders should do what their name says they do - encode existing columns. If there's a need to add an intercept, it has to be done consciously by the user. Especially for someone using statsmodels, where the default way is to use sm.add_constant. If you're in favor of keeping this feature as a parameter I'd suggest to make it False by default, otherwise I'd remove it completely.

jacekkrawiec avatar Jun 09 '23 08:06 jacekkrawiec

Ok this seems to be a consensus, let's remove it - at least by default

PaulWestenthanner avatar Jun 11 '23 13:06 PaulWestenthanner