Reconsider penalty scaling for SLOPE
In SLOPE version 0.3.0 and above, the penalty in the SLOPE objective is scaled depending on the type of scaling that is used in the call to SLOPE(). The behavior is:
- for
scaling = "l1", no scaling is applied - for
scaling = "l2", the penalty is scaled withsqrt(n) - for
scaling = "sd", the penalty is scaled withn`.
There are advantages and disadvantages of doing this kind of scaling, and I think a discussion is warranted regarding what the correct behavior should be.
Pros
- Regularization strength is independent from the number of observations, which means that the same level of regularization is applied over, for instance, differently sized resamples in cross-validation or when fitting a trained model on a test data set.
- Scaling the penalty is standard practice in many implementations of l1-regularized models, such as glmnet, ncvreg, biglasso
- Having regularization strength independent from the number of observations means that the model can still control for misspecification as n becomes large.
Cons
- The fact that the penalty scaling differs depending on type of standardization can be confusing.
- Overfitting becomes less and less of an issue as n becomes larger, so it makes sense to decrease the regularization strength as n grows.
- The model definition is now somewhat different from the definitions used in almost all publications, which also means that the interpretation of the
alphaparameter as variance in the orthogonal X case is lost.
Possible solutions
Whichever way we go with this, I think we should keep the other option available as a toggle, i.e. add an argument along the lines of penalty_scaling to turn off/on penalty scaling, or even to provide a more fine-grained type of penalty scaling. That way, it would be possible to achieve either behavior, which, really, means that this discussion is really about what the default should be.
Thoughts? Ideas?
References
Hastie et al. (2015) mentions that scaling with n is "useful for cross-validation" and makes lambda values comparable for different sizes of samples, but otherwise doesn't seem to mention it.
- Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: The lasso and generalizations (1 edition). Chapman and Hall/CRC.
scikit-learn has a brief article covering these things here: https://scikit-learn.org/stable/auto_examples/svm/plot_svm_scale_c.html
As default I would use the same as glmnet? I agree that it should def be an option. Could you put in some references to what people are doing at different places?
As default I would use the same as glmnet? I agree that it should def be an option. Could you put in some references to what people are doing at different places?
I updated the post with a couple of references, but I'm having a hard time finding references on this.
Could you start an overleaf of this also? We should write down the equations so one can have clearer disccusion about them. Further the naming should be on the scaling not loss function, in my oponion. I.e. 'l1' should be 'none', then if we have 'l1' loss implemented we should say that default there is none?
Could you start an overleaf of this also? We should write down the equations so one can have clearer disccusion about them.
Yes, absolutely.
Further the naming should be on the scaling not loss function, in my oponion. I.e. 'l1' should be 'none', then if we have 'l1' loss implemented we should say that default there is none?
not exactly sure what you mean here
not exactly sure what you mean here
scaling = "l1", no scaling is applied. The scaling is should not be named after lose function so rather. scaling = 'none'.