patsy
patsy copied to clipboard
Optional removal of redundant columns
Patsy automatically remove redundant columns (linearly dependent) so that the final matrix is not overdetermined. is there an option to turn off the removal? I would like to use patsy formulas for regularized linear regression and for that I need all the columns, even if they repeat.
The reason patsy doesn't do this is just that I'm not sure what the right API approach is :-)
If you have a "regularized regression" model that takes a formula and always wants to have redundancy removal eliminated, then you want something like an extra option to dmatrix
to do this.
If you want to be able to turn this on-and-off based on your whims as a user, then you want something inside the formula itself to determine the behaviour. (An idea I've played around with in the past is to have formulas like y ~ ...
act like they currently do, and y = ...
would turn off both redundancy removal and automatic intercept handling, since some people complain about both of those. But maybe in general people want the option to treat redundancy removal and automatic intercept handling separately.)
The very simple suggestion in #60 would at least make this possible, albeit awkward. (You'd have to explicitly override patsy's default redundancy removal on a factor-by-factor basis, y ~ C(a, RedundantDummy) + C(b, RedundantDummy) + ...
. The upside is that it's clearly a useful thing to do that will only be like 5 lines of code, and not cause any tricky/controversial interactions with the rest of the patsy api, so whenever someone feels like adding it they can just send a PR and I'll review and merge it.)
CC @josef-pkt
If you have a "regularized regression" model that takes a formula and always wants to have redundancy removal eliminated, then you want something like an extra option to dmatrix to do this.
I'd like this.
... maybe in general people want the option to treat redundancy removal and automatic intercept handling separately.)
Yes.
The very simple suggestion in #60 would at least make this possible, albeit awkward.
That sounds good too.
Patsy's great, thanks!
In statsmodels most regularization and constraints will be optionally on only some terms, e.g. in GAM we only penalize the splines. So, from this it would be easier to directly control the factor codings in the formula itself. (However, it should be easy to fix with a small global penalty if the option affects all terms in a formula).
About removing all constant effects: Another option would be to introduce something like -2
instead of -1 - 1
which doesn't work by duplicate removal, where -2
means don't add explicit nor implicit constants.
On the question of specifying this stuff in the formula vs. as an argument to dmatrix
:
Really, both seem helpful. If there are good ways to specify in the formula, I imagine most direct users of patsy would do that, but it's good for libraries using patsy to be able to set up a different default behavior for their problem/domain/context using an argument to dmatrix
.
Another use case for this is in a "predict" function where you are taking a fitted model and using it to predict at a new set of points. Since no model is being fit, there is no need for the design matrix to be nonsingular. I have had a lot of trouble with this when doing predictions with formulas in statsmodels.
@kshedden: you mean, you have some model that you fit without using Patsy, with some sort of redundant coding scheme, and now you want to use Patsy for doing predictions, so you need to convince Patsy to match whatever thing was done originally?
Sorry for the noise, I was using a modified version of patsy and confused myself, all is fine.