category_encoders
category_encoders copied to clipboard
Intercept in Contrast Coding Schemes
Expected Behavior
The constant (all values 1) intercept column should not be added when applying contrast coding schemes (i.e. backward difference, sum, polynomial and helmert coding)
I don't think this intercept column is needed. If you fit a supervised learning model it is probably gonna help to remove the intercept column. I think it is there because when fitting linear models with statsmodels you have to add the intercept.
However I don't like that the output of an encoder would then depend on whether the intercept column is already there or not, e.g. if I first apply encoder A on column A and then encoder B on column B the intercept column of B overwrite A's intercept column hence not adding a new column. Also if I have (for some reason) a column called intercept that is not constant it would get overwritten.
Any opinion? Am I missing something? Is the intercept necessary?
Actual Behavior
A constant column with all values 1 is added
Steps to Reproduce the Problem
Run transform on any fitted contrast coding encoder, e.g.
train = ['A', 'B', 'C']
encoder = encoders.BackwardDifferenceEncoder(handle_unknown='value', handle_missing='value')
encoder.fit_transform(train)
Can intercept be added as a class parameter? If so, then this is the way to go, imo. Then these classes could be tested with different values of intercept to catch errors and bugs.
Yes we could I think, that should be rather straight forward as well. Would you set with_intercept=True
as a default for backwards compatibility or not (which might be more correct)?
Yes we could I think, that should be rather straight forward as well. Would you set
with_intercept=True
as a default for backwards compatibility or not (which might be more correct)?
Yep, I would set it to True to keep the default behavior intact
Is this one still relevant?
I think it is, I'm slightly in favor of setting with_intercept
to False
as I don't think many people will need the intercept.
However I'd like to keep this issue open for a little while and hope for more voices from the community, maybe from actual users of the contrast coding schemes. I feel a little uneasy to take this decision by myself without further discussion
user's perspective: imo encoders should do what their name says they do - encode existing columns. If there's a need to add an intercept, it has to be done consciously by the user. Especially for someone using statsmodels
, where the default way is to use
sm.add_constant
. If you're in favor of keeping this feature as a parameter I'd suggest to make it False
by default, otherwise I'd remove it completely.
Ok this seems to be a consensus, let's remove it - at least by default