scikit-survival
scikit-survival copied to clipboard
Add predict helpers such as: predict_expectation, predict_percentile (similar to lifelines)
Hi! I really like your package implementation and its compatibility with sklearn, but one thing that is refraining us from using it in our solution is the lack of the above predict function implementations (particularly predict_expectation). In our case, we need to know the expected number of days a given client is expected to remain a client. Ranking observations by the obtained "arbitrary" score will not work.
Given that the above function is based on the estimated survival functions it should be possible to implement (as long as I read the lifelines code correctly):
def predict_expectation(self, X: DataFrame, conditional_after: Optional[ndarray] = None) -> pd.Series:
r"""
Compute the expected lifetime, :math:`E[T]`, using covariates X. This algorithm to compute the expectation is
to use the fact that :math:`E[T] = \int_0^\inf P(T > t) dt = \int_0^\inf S(t) dt`. To compute the integral, we use the trapezoidal rule to approximate the integral.
Caution
--------
If the survival function doesn't converge to 0, then the expectation is really infinity and the returned
values are meaningless/too large. In that case, using ``predict_median`` or ``predict_percentile`` would be better.
Parameters
----------
X: numpy array or DataFrame
a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns
can be in any order. If a numpy array, columns must be in the
same order as the training data.
conditional_after: iterable, optional
Must be equal is size to X.shape[0] (denoted `n` above). An iterable (array, list, series) of possibly non-zero values that represent how long the
subject has already lived for. Ex: if :math:`T` is the unknown event time, then this represents :math:`s` in
:math:`T | T > s`. This is useful for knowing the *remaining* hazard/survival of censored subjects.
The new timeline is the remaining duration of the subject, i.e. normalized back to starting at 0.
Notes
-----
If X is a DataFrame, the order of the columns do not matter. But
if X is an array, then the column ordering is assumed to be the
same as the training dataset.
See Also
--------
predict_median
predict_percentile
"""
subjects = utils._get_index(X)
v = self.predict_survival_function(X, conditional_after=conditional_after)[subjects]
return pd.Series(trapz(v.values.T, v.index), index=subjects)
Reference to the lifelines implementation:
- https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxPHFitter.html#lifelines.fitters.coxph_fitter.SemiParametricPHFitter.predict_expectation
- https://github.com/CamDavidsonPilon/lifelines/blob/master/lifelines/fitters/coxph_fitter.py
It is correct that the expected survival time can be estimated with the integral under the survival function. However, this only leads to a reasonable estimate if the last observed time point corresponds to an event such that the survival function is zero thereafter. If that's not the case, the expected survival time is undefined without making any additional assumptions.
Typical assumptions are (i) assuming the remaining instances experience an event directly after the last observed time point (S(t) becomes zero), (ii) assuming the remaining instances never experience an event (S(t) remains constant), (iii) assuming a parametric form, e.g. exponential curve, at the tail. Finally, estimating the restricted mean survival time, i.e., an expectation over a pre-defined interval rather than all of time, is from a practical perspective the most reasonable.
I don't know how lifelines handles this, but I think, when implementing this, it is important to make it explicit that estimating the mean survival time can be tricky and avoid giving an answer for which the underlying assumptions are not clear.
Hi @sebp and thanks for coming back on this! Yes, there's a note in the lifelines
documentation referring to what you have mentioned:
In general, I would consider this a really desirable feature in one of the future releases. I find your package much better integrated with the sklearn ecosystem, but the lack of these prediction helpers prevents us from using it. An additional point in favour of implementing this is that when I check other package implementations, such as the quite famous survival
package in R, it also provides an option to predict the expected survival time.
I'd be happy to contribute here with a PR 😉
If you can contribute with a PR, that would be even better.
Any updates on this issue for implementation? @sebp @konradsemsch
@PragyanSubedi I did not have time to look into it, I'm afraid.
@PragyanSubedi - I've started looking into that. I should have a first PR draft in a week.
@sebp - I added a basic first PR: https://github.com/sebp/scikit-survival/pull/196. Please take a look at it and let me know if you agree. Once we have that, I'd move further with implementing tests and adding this helper to other models as well.
Hi, @sebp : I wonder why not considering a predict_hazard_function in addition to predict_cumulative_hazard_function ? Or is there an obvious and simple way to get it, that I'm missing... Thank you very much for all the great work!
By definition, it would be the derivative of the cumulative hazard function.
In general, directly estimating the hazard function is tricky.
I'm curious, what would you like to use the hazard function for?
Hi, @konradsemsch and @sebp are you still working with this predict_expectation function? Would be really interested about it.
@tomikrogerus I'm currently not working on it. PR #196 is inactive too.