auton-survival icon indicating copy to clipboard operation
auton-survival copied to clipboard

AssertionError: Times should be within the range of event times to avoid exterpolation

Open rvandewater opened this issue 7 months ago • 2 comments

Hi,

Thank you for creating this package.

I am encountering an error when using my own dataset for creating a survival regression model (see below). I am using the Survival Regression with Auton-Survival notebook with the cox proportional hazards model (see code below error). I am using a preprocessed dataset extracted from eICU with the max time value 168 for train, test, and val.

What I tried: when I try to replace the 168 in validation to 167 it gives me the same error. I checked the original example, and there seems to be the same situation that the max value in validation is equal to the same value in training; however, it does not throw an error here.

Thank you for your help.

  nonnumeric_cols = [col for (col, dtype) in df.dtypes.iteritems() if dtype.name == "category" or dtype.kind not in "biuf"]

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[44], line 22
     20     # Obtain survival probabilities for validation set and compute the Integrated Brier Score 
     21     predictions_val = model.predict_survival(x_val, times)
---> 22     metric_val = survival_regression_metric('ibs', y_val, predictions_val, times, y_tr)
     23     models.append([metric_val, model])
     25 # Select the best model based on the mean metric value computed for the validation set

File ~/projects/auton-survival/auton_survival/metrics.py:215, in survival_regression_metric(metric, outcomes, predictions, times, outcomes_train, n_bootstrap, random_seed)
    211     outcomes_train = outcomes
    212     warnings.warn("You are are evaluating model performance on the \
    213 same data used to estimate the censoring distribution.")
--> 215   assert max(times) < outcomes_train.time.max(), "Times should \
    216 be within the range of event times to avoid exterpolation."
    217   assert max(times) <= outcomes.time.max(), "Times \
    218 must be within the range of event times."
    220   survival_train = util.Surv.from_dataframe('event', 'time', outcomes_train)

AssertionError: Times should be within the range of event times to avoid exterpolation.
from auton_survival.estimators import SurvivalModel
from auton_survival.metrics import survival_regression_metric
from sklearn.model_selection import ParameterGrid

# Define parameters for tuning the model
param_grid = {'l2' : [1e-3, 1e-4]}
params = ParameterGrid(param_grid)

# Define the times for model evaluation
times = np.quantile(y_tr['time'][y_tr['event']==1], np.linspace(0.1, 1, 10)).tolist()

# Perform hyperparameter tuning 
models = []
for param in params:
    model = SurvivalModel('cph', random_seed=2, l2=param['l2'])
    
    # The fit method is called to train the model
    model.fit(x_tr, y_tr)

    # Obtain survival probabilities for validation set and compute the Integrated Brier Score 
    predictions_val = model.predict_survival(x_val, times)
    metric_val = survival_regression_metric('ibs', y_val, predictions_val, times, y_tr)
    models.append([metric_val, model])
    
# Select the best model based on the mean metric value computed for the validation set
metric_vals = [i[0] for i in models]
first_min_idx = metric_vals.index(min(metric_vals))
model = models[first_min_idx][1]

rvandewater avatar Nov 22 '23 19:11 rvandewater

Hi @rvandewater, thanks for contributing to auton-survival 🙂

Given a DeepCoxPH model trained on a survival dataset X_train, Y_train ~ features, (events, times) the min and max admissible times to compute the survival_regression_metric are, as you noted:

min_time = min(Y_train.times.values) + 1
max_time = max(Y_train.times.values) - 1

To avoid this problem you have three options:

  1. Apply an upper cut-off of max_time to your times
  2. Drop the last decile(s)
  3. Circumvent the problem and compute a static metric (shouldn't differ much from your average of time-dependent metrics) e.g. with sksurv.metrics.concordance_index_censored:
from sksurv import metrics
from auton_survival import DeepCoxPH
import torch

model = DeepCoxPH()

# ... train model ...

# Use model.torch_model[0] to access the `torch.nn.Module` that computes risk scores for DeepCox
# A better (and retro-compatible) API to access the PyTorch module will be available in the next updates 
with torch.inference_mode():
  model.torch_model[0].eval()
  
  X_test, Y_test = get_test_data()  

  risk_scores = model.torch_model[0](X_test)  

  concordance_index_censored = metrics.concordance_index_censored(
      Y_test.events.values.astype(bool),
      Y_test.times.values,
      risk_scores.squeeze(),
  )

I'm not sure if this satisfies your question, let me know if you need anything else

NB: I'm copying your code with syntax highlighting so it's easier to read (you can enable it by writing "```python" instead of " ```" at the start of the code block):

nonnumeric_cols = [col for (col, dtype) in df.dtypes.iteritems() if dtype.name == "category" or dtype.kind not in "biuf"]

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[44], line 22
     20     # Obtain survival probabilities for validation set and compute the Integrated Brier Score 
     21     predictions_val = model.predict_survival(x_val, times)
---> 22     metric_val = survival_regression_metric('ibs', y_val, predictions_val, times, y_tr)
     23     models.append([metric_val, model])
     25 # Select the best model based on the mean metric value computed for the validation set

File ~/projects/auton-survival/auton_survival/metrics.py:215, in survival_regression_metric(metric, outcomes, predictions, times, outcomes_train, n_bootstrap, random_seed)
    211     outcomes_train = outcomes
    212     warnings.warn("You are are evaluating model performance on the \
    213 same data used to estimate the censoring distribution.")
--> 215   assert max(times) < outcomes_train.time.max(), "Times should \
    216 be within the range of event times to avoid exterpolation."
    217   assert max(times) <= outcomes.time.max(), "Times \
    218 must be within the range of event times."
    220   survival_train = util.Surv.from_dataframe('event', 'time', outcomes_train)

AssertionError: Times should be within the range of event times to avoid exterpolation.
from auton_survival.estimators import SurvivalModel
from auton_survival.metrics import survival_regression_metric
from sklearn.model_selection import ParameterGrid

# Define parameters for tuning the model
param_grid = {'l2' : [1e-3, 1e-4]}
params = ParameterGrid(param_grid)

# Define the times for model evaluation
times = np.quantile(y_tr['time'][y_tr['event']==1], np.linspace(0.1, 1, 10)).tolist()

# Perform hyperparameter tuning 
models = []
for param in params:
    model = SurvivalModel('cph', random_seed=2, l2=param['l2'])
    
    # The fit method is called to train the model
    model.fit(x_tr, y_tr)

    # Obtain survival probabilities for validation set and compute the Integrated Brier Score 
    predictions_val = model.predict_survival(x_val, times)
    metric_val = survival_regression_metric('ibs', y_val, predictions_val, times, y_tr)
    models.append([metric_val, model])
    
# Select the best model based on the mean metric value computed for the validation set
metric_vals = [i[0] for i in models]
first_min_idx = metric_vals.index(min(metric_vals))
model = models[first_min_idx][1]

matteo4diani avatar Nov 24 '23 11:11 matteo4diani

Hi @matteo4diani, thanks for your answer. I believe the manual cutting-off that you suggested was not even needed, but I replaced this line:

times = np.quantile(y_tr['time'][y_tr['event']==1], np.linspace(0.1, 1, 10)).tolist()

With this line:

times = np.quantile(y_val['time'][y_val['event']==1], np.linspace(0.1, 1, 10)).tolist()

The training data quantiles are validated within the code. I am not sure if this is intended like this as according to https://autonlab.org/auton-survival/metrics.html this should probably be based on the validation or test set and not the training set:

times : np.array The time points at which to compute metric value(s)

rvandewater avatar Nov 29 '23 19:11 rvandewater