skglm icon indicating copy to clipboard operation
skglm copied to clipboard

Gram-based CD/BCD/FISTA solvers for (group)Lasso when `n_samples >> n_features`

Open PABannier opened this issue 3 years ago • 13 comments

The goal of this PR is to write a CD (BCD) solver when n_samples >> n_features. Such configurations are solved much faster by pre-computing a Gram matrix XtX and updating the gradient (rather than the residuals) at every CD cycle.

A quick experiment with 1e6 samples and 300 features:

########
Lasso
########

Celer: 5.43s
Gram: 1.89s

###########
Group Lasso
###########

Celer: 43.41s
Gram: 4.23s

PABannier avatar Apr 20 '22 22:04 PABannier

@PABannier thanks a lot ! I tried it and it really shines on the data by @sehoff

Caveat: the stopping criteria are not the same. Monitoring the primal decrease is much looser than checking the duality gap. I adjusted the tolerance manually to obtain similar results

mathurinm avatar Apr 21 '22 07:04 mathurinm

Thanks a lot! I tried it out (on slightly different data than I provided), and can confirm sizable speedups. I use warm-starts for celer, so I guess the figures are even biased towards celer.

Note: in the current setting of the gram-solver: res = gram_group_lasso(X, y, a, groups=grps, tol=1e-10, max_iter=10_000, check_freq=10) with very small alphas (alpha_max/1000), the solver does not convergence after max_iter.

Do you have any suggestion on a reasonable decrease in the tolerance? Because for tol=1e-9 it reaches convergence after ~3500 iterations. speed_comparison_celer_gram .

sehoff avatar Apr 22 '22 12:04 sehoff

@sehoff hard to tell without the data but consider increasing max_iter, probably the solver is not very far from convergence.You can run it in verbose mode to see if you stop far from convergence or not.

It's easy to add warm start to the gram solvers, it should be done shortly. Beware in your comparison that the stopping criteria are not the same.

mathurinm avatar Apr 22 '22 15:04 mathurinm

Quick update:

  • [x] Updated stopping criterion to duality gap
  • [x] Gram CD
  • [x] Gram BCD
  • [x] Gram CD - FISTA
  • [x] Gram BCD - FISTA

PABannier avatar Apr 24 '22 22:04 PABannier

@sehoff support for weights (inifnite ones too) is here. Let us know how it works !

mathurinm avatar Apr 25 '22 08:04 mathurinm

@sehoff hard to tell without the data but consider increasing max_iter, probably the solver is not very far from convergence.You can run it in verbose mode to see if you stop far from convergence or not.

It's easy to add warm start to the gram solvers, it should be done shortly. Beware in your comparison that the stopping criteria are not the same.

Increasing max_iter works for me, thank you !

sehoff avatar Apr 26 '22 18:04 sehoff

@sehoff For very small alphas, I'd recommend using FISTA instead. Have a look at the gram_fista_group_lasso function!

PABannier avatar Apr 26 '22 18:04 PABannier

@sehoff support for weights (inifnite ones too) is here. Let us know how it works ! @sehoff For very small alphas, I'd recommend using FISTA instead. Have a look at the gram_fista_group_lasso function!

Here are the results of my comparisons, where I used the data I provided in the dropbox, and a tol=1e-8 in all solvers. Furthermore, the results I show refer to Case 3.1., i.e., with one weight set to infinity. As a side remark, this highlights that all three solvers can handle infinite weights! Bottomline: especially for (very) small alphas the Gram-based solvers outperform the one in celer significantly. So depending on the grid one eventually searches, each solver has its advantages. Comparison 1: speed_comparison_celer_gram Comparison 2: Note: I leave out celer here, because for alpha_max/1000 it takes quite long for convergence. speed_comparison_bcd_fista

sehoff avatar Apr 26 '22 18:04 sehoff

@sehoff indeed, Celer is particularly efficient in settings where n_features >> n_samples. It implements a working set strategy, that is particularly useful in high-regularization regime (where there are few active features). In your first figure, for low level of regularization, since your design matrix has way more samples than features, Celer working set strategy is less useful.

Besides, for your data, the Gram solver has a cheaper update since X.T @ X is of size (n_features, n_features) (X.T @ X being a useful quantity at every gradient update).

PABannier avatar Apr 26 '22 19:04 PABannier

@sehoff Thanks for the plots. They are very insightful for us. By the way, if you ever find yourself in need for an automated way of benchmarking optimization routines, look at https://github.com/benchopt/benchopt

PABannier avatar Apr 26 '22 19:04 PABannier

@mathurinm do we want to keep this PR open? FISTA and Gram-solver have been merged

PABannier avatar Oct 22 '22 15:10 PABannier

This one supports groups while the merged PR doesn't, so it does not hurt to leave it open IMO

mathurinm avatar Oct 22 '22 15:10 mathurinm

note: this is a PR from the skglm repo, instead it should be done from a branch of your fork @PABannier

mathurinm avatar Apr 11 '24 08:04 mathurinm