lifelines
lifelines copied to clipboard
ENH improve speed of fitting Cox model by relying on the fast skglm solver
I am one of the developers of skglm
, a python package that improves scikit-learn
for Generalized Linear Models by providing more functionalities and faster solvers.
We have recently worked on a solver for the Cox estimator, for which lifelines
provides a reference implementation.
Preliminary results indicate time speedups up to x500 when using skglm
Here is a notebook to illustrate the peformance and to showcase the scikit-learn
like API of skglm
. Also, here are the results of a complete benchmark with Benchopt and the link to the benchmark repo to reproduce it.
In addition, some skglm
features might be useful to the users of lifelines
:
- support of design matrix with more columns than rows (may cause issue in
lifelines
) - support of sparse design matrix (currently not supported in
lifelines
) - immediate extension to other penalizers such as Weighted L1, non convex regularizers, group Lasso penalty, etc
Based on this, we'd like to discuss the potential integration of skglm
solver into lifelines
for fitting the Cox Estimator.
A noteworthy point is that skglm
relies heavily on numba
JIT compilation, which may introduce a slight overhead during the initial model fit. However, this inconvenience is compensated by the gained advantages namely handling datasets with thousands of features and samples within a reasonable time.
We'd be happy to have your feedback on this.
Also pinging @Badr-Moufad @PABannier @QB3
Wow that's very impressive! One thing I think you should try is to bin the times into buckets (as tied times are common in survival datasets, as we are often rounding to months, days, hours, etc.). The Cox model works by sorting times, but when there are ties, it has to use a technique to handle them. There are a few technique to handle ties: random, Efron, Breslow, exact (the most accurate, but slowest). Lifelines uses Efron's method, as its accuracy-to-speed tradeoff is good.
Indeed, we are working on adding support for Efron handling of ties here : https://github.com/scikit-learn-contrib/skglm/pull/159, it should be merged shortly.
Very exciting work, team!
@BadrMoufad has just added support for the Efron handling of ties here : https://github.com/scikit-learn-contrib/skglm/pull/159 Benchmarks results are the same
I'm impressed. I'm going to have to try this library locally.
Is the following (mostly) correct?
One significant speed up is from using an approximation to the Hessian. This approximation is valid to use, and can be shown that using it will still converge to the same solution (albeit with perhaps more iterations, but the cost-savings are still there).
Thank you again for your interest! Here are the key improvement factors
- Levaraging the sparse nature of the solution with state-of-the-art working set strategy detailed in our Neurips 2022 paper (Algo 1 and 2)
- Usage of Proximal Newton solver with diagonal upper-bound on the Hessian resulting in a linear computational and memory cost (
skglm
tutorial equation 6) - Efficient implementation of Cox datafit which achieves a linear cost of evaluating its value, gradient, and Hessian (
skglm
Cox implementation)
We are happy to discuss options for integrating skglm
into lifelines
.