skglm
skglm copied to clipboard
Gram-based CD/BCD/FISTA solvers for (group)Lasso when `n_samples >> n_features`
The goal of this PR is to write a CD (BCD) solver when n_samples >> n_features.
Such configurations are solved much faster by pre-computing a Gram matrix XtX and updating the gradient (rather than the residuals) at every CD cycle.
A quick experiment with 1e6 samples and 300 features:
########
Lasso
########
Celer: 5.43s
Gram: 1.89s
###########
Group Lasso
###########
Celer: 43.41s
Gram: 4.23s
@PABannier thanks a lot ! I tried it and it really shines on the data by @sehoff
Caveat: the stopping criteria are not the same. Monitoring the primal decrease is much looser than checking the duality gap. I adjusted the tolerance manually to obtain similar results
Thanks a lot! I tried it out (on slightly different data than I provided), and can confirm sizable speedups. I use warm-starts for celer, so I guess the figures are even biased towards celer.
Note: in the current setting of the gram-solver: res = gram_group_lasso(X, y, a, groups=grps, tol=1e-10, max_iter=10_000, check_freq=10) with very small alphas (alpha_max/1000), the solver does not convergence after max_iter.
Do you have any suggestion on a reasonable decrease in the tolerance? Because for tol=1e-9 it reaches convergence after ~3500 iterations.
.
@sehoff hard to tell without the data but consider increasing max_iter, probably the solver is not very far from convergence.You can run it in verbose mode to see if you stop far from convergence or not.
It's easy to add warm start to the gram solvers, it should be done shortly. Beware in your comparison that the stopping criteria are not the same.
Quick update:
- [x] Updated stopping criterion to duality gap
- [x] Gram CD
- [x] Gram BCD
- [x] Gram CD - FISTA
- [x] Gram BCD - FISTA
@sehoff support for weights (inifnite ones too) is here. Let us know how it works !
@sehoff hard to tell without the data but consider increasing max_iter, probably the solver is not very far from convergence.You can run it in verbose mode to see if you stop far from convergence or not.
It's easy to add warm start to the gram solvers, it should be done shortly. Beware in your comparison that the stopping criteria are not the same.
Increasing max_iter works for me, thank you !
@sehoff For very small alphas, I'd recommend using FISTA instead. Have a look at the gram_fista_group_lasso function!
@sehoff support for weights (inifnite ones too) is here. Let us know how it works ! @sehoff For very small alphas, I'd recommend using FISTA instead. Have a look at the gram_fista_group_lasso function!
Here are the results of my comparisons, where I used the data I provided in the dropbox, and a tol=1e-8 in all solvers. Furthermore, the results I show refer to Case 3.1., i.e., with one weight set to infinity. As a side remark, this highlights that all three solvers can handle infinite weights!
Bottomline: especially for (very) small alphas the Gram-based solvers outperform the one in celer significantly. So depending on the grid one eventually searches, each solver has its advantages.
Comparison 1:
Comparison 2: Note: I leave out celer here, because for alpha_max/1000 it takes quite long for convergence.

@sehoff indeed, Celer is particularly efficient in settings where n_features >> n_samples. It implements a working set strategy, that is particularly useful in high-regularization regime (where there are few active features). In your first figure, for low level of regularization, since your design matrix has way more samples than features, Celer working set strategy is less useful.
Besides, for your data, the Gram solver has a cheaper update since X.T @ X is of size (n_features, n_features) (X.T @ X being a useful quantity at every gradient update).
@sehoff Thanks for the plots. They are very insightful for us. By the way, if you ever find yourself in need for an automated way of benchmarking optimization routines, look at https://github.com/benchopt/benchopt
@mathurinm do we want to keep this PR open? FISTA and Gram-solver have been merged
This one supports groups while the merged PR doesn't, so it does not hurt to leave it open IMO
note: this is a PR from the skglm repo, instead it should be done from a branch of your fork @PABannier