PatientLevelPrediction icon indicating copy to clipboard operation
PatientLevelPrediction copied to clipboard

Prediction probabilities larger than 1 with Cox model

Open rekkasa opened this issue 2 years ago • 3 comments

Describe the bug I am using the develop branch. When I run predictPlp using an existing model on new data with a pre-specified timepoint (I made sure it was smaller than the largest time at risk) the resulting values can be higher than 1. Am I doing something wrong?

Additional context In file CyclopsModels.R line 261 is commented out in favor of line 262. Could it be the other way around?

rekkasa avatar Aug 02 '22 16:08 rekkasa

hmm - that is odd. Do you know how large the values are you getting are (i.e., could they be rounding errors causing it to be a little bit over or is it a bug causing it to be way over)?

jreps avatar Aug 15 '22 16:08 jreps

In my case the probabilities could be quite off (even more than 2). I can create an example tomorrow, but for now my thoughts on this issue are:

  1. predictPlp calls predictionFunction attribute of the plpModel which for a Cox model is predictCyclops (line 76, in Predict.R)
  2. predictCyclops creates a data frame named prediction (line 245, in CyclopsModels.R) calling function predictCyclopsType
  3. Function predictCyclopsType creates the linear predictor (lines 298-308, in CyclopsModels.R) and returns the exponential value in the case of model type "survival".
  4. Therefore, data frame prediction of line 245 (CyclopsModels.R) is $exp(lp)$ which should be an unbounded positive number.
  5. The final version of the value column of the prediction data frame is defined in line 262 (CyclopsModels.R) as $(1-S)\times exp(lp)$ which should also be an unbounded positive number. The line above (line 261) I think gives the probability of the event before the selected timepoint

Isn't the probability before the selected timepoint what we should be expecting from predictPlp?

rekkasa avatar Aug 15 '22 17:08 rekkasa

Some code to reproduce:

library(PatientLevelPrediction)
connectionDetails <- Eunomia::getEunomiaConnectionDetails()

Eunomia::createCohorts(
  connectionDetails = connectionDetails, 
  cdmDatabaseSchema = 'main', 
  cohortDatabaseSchema = 'main', 
  cohortTable = 'cohort'
)

databaseDetails <- PatientLevelPrediction::createDatabaseDetails(
  connectionDetails = connectionDetails, 
  cdmDatabaseSchema = 'main', 
  cdmDatabaseName = 'Eunomia', 
  cohortDatabaseSchema = 'main', 
  cohortTable = 'cohort', 
  targetId = 4,
  outcomeDatabaseSchema = 'main', 
  outcomeTable = 'cohort', 
  outcomeId = 3,
  cdmVersion = 5
)

covariateSettings <- FeatureExtraction::createCovariateSettings(
  useDemographicsGender = T, 
  useDemographicsAgeGroup = T, 
  useConditionGroupEraLongTerm = T, 
  useDrugGroupEraLongTerm = T, 
  endDays = -1, 
  longTermStartDays = -365
)

restrictPlpDataSettings <- PatientLevelPrediction::createRestrictPlpDataSettings(
  studyStartDate = '20000101',
  studyEndDate = '20200101',
  firstExposureOnly = T, 
  washoutPeriod = 30
)

# issue with studyStartDate/studyEndDate
restrictPlpDataSettings <- PatientLevelPrediction::createRestrictPlpDataSettings(
  firstExposureOnly = T, 
  washoutPeriod = 30
)

plpData <- PatientLevelPrediction::getPlpData(
  databaseDetails = databaseDetails, 
  covariateSettings = covariateSettings, 
  restrictPlpDataSettings = restrictPlpDataSettings
)

populationSettings <- PatientLevelPrediction::createStudyPopulationSettings(
  binary = T, 
  includeAllOutcomes = F, # best practice showed this causes bias
  firstExposureOnly = F, 
  washoutPeriod = 180, 
  removeSubjectsWithPriorOutcome = F, 
  priorOutcomeLookback = 99999, 
  requireTimeAtRisk = F, 
  riskWindowStart = 1, 
  startAnchor = 'cohort start', 
  riskWindowEnd = 365, 
  endAnchor = 'cohort start' 
)

splitSettings <- PatientLevelPrediction::createDefaultSplitSetting(
  testFraction = 0.25, 
  trainFraction = 0.75, 
  splitSeed = 123, 
  nfold = 3, 
  type = 'stratified'
)

sampleSettings <- PatientLevelPrediction::createSampleSettings(
  type = 'none'
)

featureEngineeringSettings <- PatientLevelPrediction::createFeatureEngineeringSettings(
  type = 'none'
)

preprocessSettings <- PatientLevelPrediction::createPreprocessSettings(
  minFraction = 0, 
  normalize = T, 
  removeRedundancy = T
)

logSettings <- PatientLevelPrediction::createLogSettings(verbosity = 'INFO')

executeSettings <- PatientLevelPrediction::createExecuteSettings(
  runSplitData = T, 
  runSampleData = F, 
  runfeatureEngineering = F, 
  runPreprocessData = T, 
  runModelDevelopment = T, 
  runCovariateSummary = T
)

modelSettings <- PatientLevelPrediction::setCoxModel()

runPlp <- PatientLevelPrediction::runPlp( 
  plpData = plpData, 
  outcomeId = 3, 
  analysisId = 'plp_demo', 
  analysisName = 'cox_model_testing', 
  populationSettings = populationSettings, 
  splitSettings = splitSettings, 
  sampleSettings = sampleSettings, 
  featureEngineeringSettings = featureEngineeringSettings, 
  preprocessSettings = preprocessSettings, 
  modelSettings = modelSettings, 
  logSettings = logSettings, 
  executeSettings = executeSettings, 
  saveDirectory = "~/.tmp"
)

# ------------------------------------------------------------------------------
# I need to set a beta a bit bigger so I can create larger 
# values for linear predictor
# ------------------------------------------------------------------------------
runPlp$model$model$coefficients["8003"] <- 4

prediction <- predictPlp(
  plpModel = runPlp$model,
  plpData = plpData,
  population = plpData$cohorts,
  timepoint = 365
)

max(prediction$value)

This gives usually gives me a max prediction value of around 20

rekkasa avatar Aug 16 '22 13:08 rekkasa