PatientLevelPrediction Prediction probabilities larger than 1 with Cox model

Describe the bug I am using the develop branch. When I run predictPlp using an existing model on new data with a pre-specified timepoint (I made sure it was smaller than the largest time at risk) the resulting values can be higher than 1. Am I doing something wrong?

Additional context In file CyclopsModels.R line 261 is commented out in favor of line 262. Could it be the other way around?

Aug 02 '22 16:08 rekkasa

hmm - that is odd. Do you know how large the values are you getting are (i.e., could they be rounding errors causing it to be a little bit over or is it a bug causing it to be way over)?

Aug 15 '22 16:08 jreps

In my case the probabilities could be quite off (even more than 2). I can create an example tomorrow, but for now my thoughts on this issue are:

predictPlp calls predictionFunction attribute of the plpModel which for a Cox model is predictCyclops (line 76, in Predict.R)
predictCyclops creates a data frame named prediction (line 245, in CyclopsModels.R) calling function predictCyclopsType
Function predictCyclopsType creates the linear predictor (lines 298-308, in CyclopsModels.R) and returns the exponential value in the case of model type "survival".
Therefore, data frame prediction of line 245 (CyclopsModels.R) is $exp(lp)$ which should be an unbounded positive number.
The final version of the value column of the prediction data frame is defined in line 262 (CyclopsModels.R) as $(1-S)\times exp(lp)$ which should also be an unbounded positive number. The line above (line 261) I think gives the probability of the event before the selected timepoint

Isn't the probability before the selected timepoint what we should be expecting from predictPlp?

Aug 15 '22 17:08 rekkasa

Some code to reproduce:

library(PatientLevelPrediction)
connectionDetails <- Eunomia::getEunomiaConnectionDetails()

Eunomia::createCohorts(
  connectionDetails = connectionDetails, 
  cdmDatabaseSchema = 'main', 
  cohortDatabaseSchema = 'main', 
  cohortTable = 'cohort'
)

databaseDetails <- PatientLevelPrediction::createDatabaseDetails(
  connectionDetails = connectionDetails, 
  cdmDatabaseSchema = 'main', 
  cdmDatabaseName = 'Eunomia', 
  cohortDatabaseSchema = 'main', 
  cohortTable = 'cohort', 
  targetId = 4,
  outcomeDatabaseSchema = 'main', 
  outcomeTable = 'cohort', 
  outcomeId = 3,
  cdmVersion = 5
)

covariateSettings <- FeatureExtraction::createCovariateSettings(
  useDemographicsGender = T, 
  useDemographicsAgeGroup = T, 
  useConditionGroupEraLongTerm = T, 
  useDrugGroupEraLongTerm = T, 
  endDays = -1, 
  longTermStartDays = -365
)

restrictPlpDataSettings <- PatientLevelPrediction::createRestrictPlpDataSettings(
  studyStartDate = '20000101',
  studyEndDate = '20200101',
  firstExposureOnly = T, 
  washoutPeriod = 30
)

# issue with studyStartDate/studyEndDate
restrictPlpDataSettings <- PatientLevelPrediction::createRestrictPlpDataSettings(
  firstExposureOnly = T, 
  washoutPeriod = 30
)

plpData <- PatientLevelPrediction::getPlpData(
  databaseDetails = databaseDetails, 
  covariateSettings = covariateSettings, 
  restrictPlpDataSettings = restrictPlpDataSettings
)

populationSettings <- PatientLevelPrediction::createStudyPopulationSettings(
  binary = T, 
  includeAllOutcomes = F, # best practice showed this causes bias
  firstExposureOnly = F, 
  washoutPeriod = 180, 
  removeSubjectsWithPriorOutcome = F, 
  priorOutcomeLookback = 99999, 
  requireTimeAtRisk = F, 
  riskWindowStart = 1, 
  startAnchor = 'cohort start', 
  riskWindowEnd = 365, 
  endAnchor = 'cohort start' 
)

splitSettings <- PatientLevelPrediction::createDefaultSplitSetting(
  testFraction = 0.25, 
  trainFraction = 0.75, 
  splitSeed = 123, 
  nfold = 3, 
  type = 'stratified'
)

sampleSettings <- PatientLevelPrediction::createSampleSettings(
  type = 'none'
)

featureEngineeringSettings <- PatientLevelPrediction::createFeatureEngineeringSettings(
  type = 'none'
)

preprocessSettings <- PatientLevelPrediction::createPreprocessSettings(
  minFraction = 0, 
  normalize = T, 
  removeRedundancy = T
)

logSettings <- PatientLevelPrediction::createLogSettings(verbosity = 'INFO')

executeSettings <- PatientLevelPrediction::createExecuteSettings(
  runSplitData = T, 
  runSampleData = F, 
  runfeatureEngineering = F, 
  runPreprocessData = T, 
  runModelDevelopment = T, 
  runCovariateSummary = T
)

modelSettings <- PatientLevelPrediction::setCoxModel()

runPlp <- PatientLevelPrediction::runPlp( 
  plpData = plpData, 
  outcomeId = 3, 
  analysisId = 'plp_demo', 
  analysisName = 'cox_model_testing', 
  populationSettings = populationSettings, 
  splitSettings = splitSettings, 
  sampleSettings = sampleSettings, 
  featureEngineeringSettings = featureEngineeringSettings, 
  preprocessSettings = preprocessSettings, 
  modelSettings = modelSettings, 
  logSettings = logSettings, 
  executeSettings = executeSettings, 
  saveDirectory = "~/.tmp"
)

# ------------------------------------------------------------------------------
# I need to set a beta a bit bigger so I can create larger 
# values for linear predictor
# ------------------------------------------------------------------------------
runPlp$model$model$coefficients["8003"] <- 4

prediction <- predictPlp(
  plpModel = runPlp$model,
  plpData = plpData,
  population = plpData$cohorts,
  timepoint = 365
)

max(prediction$value)

This gives usually gives me a max prediction value of around 20

Aug 16 '22 13:08 rekkasa

PatientLevelPrediction PatientLevelPrediction copied to clipboard

Prediction probabilities larger than 1 with Cox model

PatientLevelPrediction
PatientLevelPrediction copied to clipboard