PatientLevelPrediction
PatientLevelPrediction copied to clipboard
Prediction probabilities larger than 1 with Cox model
Describe the bug
I am using the develop
branch. When I run predictPlp
using an existing model on new data with a pre-specified timepoint (I made sure it was smaller than the largest time at risk) the resulting values can be higher than 1. Am I doing something wrong?
Additional context
In file CyclopsModels.R
line 261 is commented out in favor of line 262. Could it be the other way around?
hmm - that is odd. Do you know how large the values are you getting are (i.e., could they be rounding errors causing it to be a little bit over or is it a bug causing it to be way over)?
In my case the probabilities could be quite off (even more than 2). I can create an example tomorrow, but for now my thoughts on this issue are:
-
predictPlp
callspredictionFunction
attribute of theplpModel
which for a Cox model ispredictCyclops
(line 76, in Predict.R) -
predictCyclops
creates a data frame namedprediction
(line 245, in CyclopsModels.R) calling functionpredictCyclopsType
- Function
predictCyclopsType
creates the linear predictor (lines 298-308, in CyclopsModels.R) and returns the exponential value in the case of model type"survival"
. - Therefore, data frame
prediction
of line 245 (CyclopsModels.R) is $exp(lp)$ which should be an unbounded positive number. - The final version of the
value
column of theprediction
data frame is defined in line 262 (CyclopsModels.R) as $(1-S)\times exp(lp)$ which should also be an unbounded positive number. The line above (line 261) I think gives the probability of the event before the selected timepoint
Isn't the probability before the selected timepoint what we should be expecting from predictPlp
?
Some code to reproduce:
library(PatientLevelPrediction)
connectionDetails <- Eunomia::getEunomiaConnectionDetails()
Eunomia::createCohorts(
connectionDetails = connectionDetails,
cdmDatabaseSchema = 'main',
cohortDatabaseSchema = 'main',
cohortTable = 'cohort'
)
databaseDetails <- PatientLevelPrediction::createDatabaseDetails(
connectionDetails = connectionDetails,
cdmDatabaseSchema = 'main',
cdmDatabaseName = 'Eunomia',
cohortDatabaseSchema = 'main',
cohortTable = 'cohort',
targetId = 4,
outcomeDatabaseSchema = 'main',
outcomeTable = 'cohort',
outcomeId = 3,
cdmVersion = 5
)
covariateSettings <- FeatureExtraction::createCovariateSettings(
useDemographicsGender = T,
useDemographicsAgeGroup = T,
useConditionGroupEraLongTerm = T,
useDrugGroupEraLongTerm = T,
endDays = -1,
longTermStartDays = -365
)
restrictPlpDataSettings <- PatientLevelPrediction::createRestrictPlpDataSettings(
studyStartDate = '20000101',
studyEndDate = '20200101',
firstExposureOnly = T,
washoutPeriod = 30
)
# issue with studyStartDate/studyEndDate
restrictPlpDataSettings <- PatientLevelPrediction::createRestrictPlpDataSettings(
firstExposureOnly = T,
washoutPeriod = 30
)
plpData <- PatientLevelPrediction::getPlpData(
databaseDetails = databaseDetails,
covariateSettings = covariateSettings,
restrictPlpDataSettings = restrictPlpDataSettings
)
populationSettings <- PatientLevelPrediction::createStudyPopulationSettings(
binary = T,
includeAllOutcomes = F, # best practice showed this causes bias
firstExposureOnly = F,
washoutPeriod = 180,
removeSubjectsWithPriorOutcome = F,
priorOutcomeLookback = 99999,
requireTimeAtRisk = F,
riskWindowStart = 1,
startAnchor = 'cohort start',
riskWindowEnd = 365,
endAnchor = 'cohort start'
)
splitSettings <- PatientLevelPrediction::createDefaultSplitSetting(
testFraction = 0.25,
trainFraction = 0.75,
splitSeed = 123,
nfold = 3,
type = 'stratified'
)
sampleSettings <- PatientLevelPrediction::createSampleSettings(
type = 'none'
)
featureEngineeringSettings <- PatientLevelPrediction::createFeatureEngineeringSettings(
type = 'none'
)
preprocessSettings <- PatientLevelPrediction::createPreprocessSettings(
minFraction = 0,
normalize = T,
removeRedundancy = T
)
logSettings <- PatientLevelPrediction::createLogSettings(verbosity = 'INFO')
executeSettings <- PatientLevelPrediction::createExecuteSettings(
runSplitData = T,
runSampleData = F,
runfeatureEngineering = F,
runPreprocessData = T,
runModelDevelopment = T,
runCovariateSummary = T
)
modelSettings <- PatientLevelPrediction::setCoxModel()
runPlp <- PatientLevelPrediction::runPlp(
plpData = plpData,
outcomeId = 3,
analysisId = 'plp_demo',
analysisName = 'cox_model_testing',
populationSettings = populationSettings,
splitSettings = splitSettings,
sampleSettings = sampleSettings,
featureEngineeringSettings = featureEngineeringSettings,
preprocessSettings = preprocessSettings,
modelSettings = modelSettings,
logSettings = logSettings,
executeSettings = executeSettings,
saveDirectory = "~/.tmp"
)
# ------------------------------------------------------------------------------
# I need to set a beta a bit bigger so I can create larger
# values for linear predictor
# ------------------------------------------------------------------------------
runPlp$model$model$coefficients["8003"] <- 4
prediction <- predictPlp(
plpModel = runPlp$model,
plpData = plpData,
population = plpData$cohorts,
timepoint = 365
)
max(prediction$value)
This gives usually gives me a max prediction value of around 20