lifelines icon indicating copy to clipboard operation
lifelines copied to clipboard

ENH improve speed of fitting Cox model by relying on the fast skglm solver

Open mathurinm opened this issue 1 year ago • 6 comments

I am one of the developers of skglm, a python package that improves scikit-learn for Generalized Linear Models by providing more functionalities and faster solvers. We have recently worked on a solver for the Cox estimator, for which lifelines provides a reference implementation.

Preliminary results indicate time speedups up to x500 when using skglm

Timing comparison between skglm and lifelines

Here is a notebook to illustrate the peformance and to showcase the scikit-learn like API of skglm. Also, here are the results of a complete benchmark with Benchopt and the link to the benchmark repo to reproduce it.

In addition, some skglm features might be useful to the users of lifelines:

  • support of design matrix with more columns than rows (may cause issue in lifelines)
  • support of sparse design matrix (currently not supported in lifelines)
  • immediate extension to other penalizers such as Weighted L1, non convex regularizers, group Lasso penalty, etc

Based on this, we'd like to discuss the potential integration of skglm solver into lifelines for fitting the Cox Estimator.

A noteworthy point is that skglm relies heavily on numba JIT compilation, which may introduce a slight overhead during the initial model fit. However, this inconvenience is compensated by the gained advantages namely handling datasets with thousands of features and samples within a reasonable time.

We'd be happy to have your feedback on this.

Also pinging @Badr-Moufad @PABannier @QB3

mathurinm avatar Jun 07 '23 09:06 mathurinm

Wow that's very impressive! One thing I think you should try is to bin the times into buckets (as tied times are common in survival datasets, as we are often rounding to months, days, hours, etc.). The Cox model works by sorting times, but when there are ties, it has to use a technique to handle them. There are a few technique to handle ties: random, Efron, Breslow, exact (the most accurate, but slowest). Lifelines uses Efron's method, as its accuracy-to-speed tradeoff is good.

CamDavidsonPilon avatar Jun 07 '23 12:06 CamDavidsonPilon

Indeed, we are working on adding support for Efron handling of ties here : https://github.com/scikit-learn-contrib/skglm/pull/159, it should be merged shortly.

mathurinm avatar Jun 07 '23 12:06 mathurinm

Very exciting work, team!

CamDavidsonPilon avatar Jun 07 '23 12:06 CamDavidsonPilon

@BadrMoufad has just added support for the Efron handling of ties here : https://github.com/scikit-learn-contrib/skglm/pull/159 Benchmarks results are the same

mathurinm avatar Jun 08 '23 16:06 mathurinm

I'm impressed. I'm going to have to try this library locally.

Is the following (mostly) correct?

One significant speed up is from using an approximation to the Hessian. This approximation is valid to use, and can be shown that using it will still converge to the same solution (albeit with perhaps more iterations, but the cost-savings are still there).

CamDavidsonPilon avatar Jun 08 '23 17:06 CamDavidsonPilon

Thank you again for your interest! Here are the key improvement factors

  • Levaraging the sparse nature of the solution with state-of-the-art working set strategy detailed in our Neurips 2022 paper (Algo 1 and 2)
  • Usage of Proximal Newton solver with diagonal upper-bound on the Hessian resulting in a linear computational and memory cost (skglm tutorial equation 6)
  • Efficient implementation of Cox datafit which achieves a linear cost of evaluating its value, gradient, and Hessian (skglm Cox implementation)

We are happy to discuss options for integrating skglm into lifelines.

Badr-MOUFAD avatar Jun 21 '23 15:06 Badr-MOUFAD