CohortMethod
CohortMethod copied to clipboard
MetaData class proposal
Hi @schuemie,
As discussed a proposal for the MetaData class for CohortMethod. I've implemented a first version of the class in R6.
The code is available in the R/R6-MetaData.R file, on the R6-proposal branch.
In this post I will go through the implementation bit by bit.
Class definition
Public
Fields
The following fields are specified:
targetId = 0,
comparatorId = 0,
studyStartDate = "",
studyEndDate = "",
attrition = NULL,
outcomeIds = NULL,
populationSize = 0,
deletedRedundantCovariateIds = NULL,
deletedInfrequentCovariateIds = NULL,
deletedRedundantCovariateIdsForOutcomeModel = NULL,
deletedInfrequentCovariateIdsForOutcomeModel = NULL,
psModelCoef = NULL,
psModelPriorVariance = NULL,
psError = "",
psHighCorrelation = NULL,
estimator = "att",
targetId
, comparatorId
, studyStartDate
, and studyEndDate
are required when initializing an instance of the MetaData class.
As of right now it is not entirely clear to me yet what parameters can be private, as I do prefer to store fields privately. Right now I'm thinking about the fields required for initialization.
Sources of fields: targetId comparatorId studyStartDate studyEndDate attrition outcomeIds populationSize deletedRedundantCovariateIds deletedInfrequentCovariateIds deletedRedundantCovariateIdsForOutcomeModel deletedInfrequentCovariateIdsForOutcomeModel psModelCoef psModelPriorVariance psError psHighCorrelation estimator
Methods
initialize
initialize = function(targetId, comparatorId, studyStartDate, studyEndDate) {
self$targetId <- targetId
self$comparatorId <- comparatorId
self$studyStartDate <- studyStartDate
self$studyEndDate <- studyEndDate
private$formatStudyDates()
self$validate()
return(invisible(self))
},
Initializer method used when MetaData$new()
is called. The validate()
method is called when a new object is initialized.
validate
validate = function() {
errorMessages <- checkmate::makeAssertCollection()
checkmate::assertInt(self$targetId, add = errorMessages)
checkmate::assertInt(self$comparatorId, add = errorMessages)
checkmate::assertCharacter(self$studyStartDate, len = 1, add = errorMessages)
checkmate::assertCharacter(self$studyEndDate, len = 1, add = errorMessages)
checkmate::assertDataFrame(self$attrition, null.ok = TRUE)
checkmate::assertInt(self$outcomeIds, na.ok = FALSE, null.ok = TRUE)
checkmate::assertInt(self$populationSize, lower = 0)
checkmate::assertInt(self$deletedRedundantCovariateIds, self$outcomeIds, na.ok = FALSE, null.ok = TRUE)
checkmate::assertInt(self$deletedInfrequentCovariateIds, self$outcomeIds, na.ok = FALSE, null.ok = TRUE)
checkmate::assertInt(self$deletedRedundantCovariateIdsForOutcomeModel, self$outcomeIds, na.ok = FALSE, null.ok = TRUE)
checkmate::assertInt(self$deletedInfrequentCovariateIdsForOutcomeModel, self$outcomeIds, na.ok = FALSE, null.ok = TRUE)
checkmate::assertNumeric(self$psModelCoef, null.ok = TRUE)
checkmate::assertNumeric(self$psModelPriorVariance, null.ok = TRUE)
checkmate::assertCharacter(self$psError)
checkmate::assertDataFrame(self$psHighCorrelation, null.ok = TRUE)
checkmate::assertChoice(self$estimator, c("ate", "att", "ato"), add = errorMessages)
checkmate::reportAssertions(collection = errorMessages)
return(invisible(self))
},
Validation method that validates each field (taken from: DataLoadingSaving.R, psFunctions.R).
getMetaData
getMetaData = function() {
return(list(
targetId = self$targetId,
comparatorId = self$comparatorId,
studyStartDate = self$studyStartDate,
studyEndDate = self$studyEndDate,
attrition = self$attrition,
outcomeIds = self$outcomeIds,
populationSize = self$populationSize,
deletedRedundantCovariateIds = self$deletedRedundantCovariateIds,
deletedInfrequentCovariateIds = self$deletedInfrequentCovariateIds,
deletedRedundantCovariateIdsForOutcomeModel = self$deletedRedundantCovariateIdsForOutcomeModel,
deletedInfrequentCovariateIdsForOutcomeModel = self$deletedInfrequentCovariateIdsForOutcomeModel,
psModelCoef = self$psModelCoef,
psModelPriorVariance = self$psModelPriorVariance,
psError = self$psError,
psHighCorrelation = self$psHighCorrelation,
estimator = self$estimator
))
},
Method to get all specified fields returned in a list. Individual public fields can be optained like so:
metaData <- MetaData$new(targetId = 1, comparatorId = 2, studyStartDate = "", studyEndDate = "")
# Get psError
metaData$psError
print (Overload)
print = function(x, ...) {
writeLines(paste("Class:", paste0(class(self), collapse = " ")))
writeLines(paste("Target ID: ", self$targetId))
writeLines(paste("Comparator ID: ", self$comparatorId))
writeLines(paste("Study Start Date: ", self$studyStartDate))
writeLines(paste("Study End Date: ", self$studyEndDate))
writeLines(paste("Attrition: ", dim(self$attrition)))
writeLines(paste("Number of Outcome IDs: ", length(self$outcomeIds)))
writeLines(paste("Population size: ", self$populationSize))
writeLines(paste("Number of redunded covariate IDs deleted: ", self$deletedRedundantCovariateIds))
writeLines(paste("Number of infrequent covariate IDs deleted: ", self$deletedInfrequentCovariateIds))
writeLines(paste("Number of redunded outcome model covariate IDs deleted: ", self$deletedRedundantCovariateIdsForOutcomeModel))
writeLines(paste("Number of infrequent outcome model covariate IDs deleted: ", self$deletedInfrequentCovariateIdsForOutcomeModel))
writeLines(paste("Propensity Score Model Coefficient: ", self$psModelCoef))
writeLines(paste("Propensity Score Model Variance: ", self$psModelPriorVariance))
writeLines(paste("Propensity Score Error", self$psError))
writeLines(paste("High Correlation Propensity Scores: ", dim(self$psHighCorrelation)))
writeLines(paste("Estimator: ", self$estimator))
return(invisible(self))
}
Overload the print generic to nicely print the current fields.
print(metaData)
Class: MetaData R6
Target ID: 1
Comparator ID: 2
Study Start Date:
Study End Date:
Attrition:
Number of Outcome IDs: 0
Population size: 0
Number of redunded covariate IDs deleted:
Number of infrequent covariate IDs deleted:
Number of redunded outcome model covariate IDs deleted:
Number of infrequent outcome model covariate IDs deleted:
Propensity Score Model Coefficient:
Propensity Score Model Variance:
Propensity Score Error
High Correlation Propensity Scores:
Estimator: att
Private
Methods
formatStudyDate
formatStudyDates = function() {
if (is.null(self$studyStartDate)) {
self$studyStartDate <- ""
}
if (is.null(self$studyEndDate)) {
self$studyEndDate <- ""
}
if (self$studyStartDate != "" &&
regexpr("^[12][0-9]{3}[01][0-9][0-3][0-9]$", self$studyStartDate) == -1) {
stop("Study start date must have format YYYYMMDD")
}
if (self$studyEndDate != "" &&
regexpr("^[12][0-9]{3}[01][0-9][0-3][0-9]$", self$studyEndDate) == -1) {
stop("Study end date must have format YYYYMMDD")
}
return(invisible(self))
}
Method to format the study end and start dates (from: DataLoadingSaving.R).
The the following section an outcomeModel class is being specified using metaData.
outcomeModel <- metaData
outcomeModel$outcomeModelTreatmentVarId <- treatmentVarId
outcomeModel$outcomeModelCoefficients <- coefficients
outcomeModel$logLikelihoodProfile <- logLikelihoodProfile
outcomeModel$outcomeModelPriorVariance <- priorVariance
outcomeModel$outcomeModelLogLikelihood <- logLikelihood
outcomeModel$outcomeModelType <- modelType
outcomeModel$outcomeModelStratified <- stratified
outcomeModel$outcomeModelUseCovariates <- useCovariates
outcomeModel$inversePtWeighting <- inversePtWeighting
outcomeModel$outcomeModelTreatmentEstimate <- treatmentEstimate
outcomeModel$outcomeModelmainEffectEstimates <- mainEffectEstimates
if (length(interactionCovariateIds) != 0) {
outcomeModel$outcomeModelInteractionEstimates <- interactionEstimates
}
outcomeModel$outcomeModelStatus <- status
outcomeModel$populationCounts <- getCounts(population, "Population count")
outcomeModel$outcomeCounts <- getOutcomeCounts(population, modelType)
outcomeModel$timeAtRisk <- getTimeAtRisk(population, modelType)
if (!is.null(subgroupCounts)) {
outcomeModel$subgroupCounts <- subgroupCounts
}
class(outcomeModel) <- "OutcomeModel"
My suggestion would be making another class called OutcomeModel, which inherits from MetaData, extending the functionality.
Thanks!
The meta data keeps growing through the pipeline (e.g. psModelCoef doesn't get added until createPs is called). I wonder if we should use a more 'compositional' approach? So we have data-loading meta data, PS-model meta-data, that together combine in to an overall meta data object. What do you think?
Should we have a separate class for attrition?
I think splitting out the metadata in two different classes is a good approach. I think for inheritance sake, it would be best for the "data-loading meta data" to inherit from the "PS-model meta data" class, as in my first example the only private function pertains the "data-loading meta data" class. The print method is then inherited. The fields can be packaged up in either a named list
or data.frame
, keeping fields ambiguous for the print method, and any other methods we might think of in future.
MetaDataPS |
---|
Attributes |
+ fields (data.frame /list ) |
Methods |
+ initialize |
+ getMetaData |
- validate |
The fields
data.frame
or list
would contain: outcomeIds, populationSize, deletedRedundantCovariateIds, deletedInfrequentCovariateIds, deletedRedundantCovariateIdsForOutcomeModel, deletedInfrequentCovariateIdsForOutcomeModel, psModelCoef, psModelPriorVariance psError, psHighCorrelation, estimator.
MetaDataLoading (MetaDataPS) |
---|
Attributes |
Methods |
+ initialize |
- validae |
- formatStudyDates |
* Cursive names are overloaded methods
The fields
data.frame
or list
would contain: targetId, comparatorId, studyStartDate, studyEndDate.
outcomeModel would then also be able to inherit from MetaDataPS
:
OutcomeModel (MetaDataPS) |
---|
**Attributes ** |
Methods |
+ initialize |
+ coef |
+ confint |
- validate |
The fields
data.frame
or list
would contain: outcomeModelTreatmentVarId, outcomeModelCoefficients, logLikelihoodProfile, outcomeModelPriorVariance, outcomeModelLogLikelihood, outcomeModelType, outcomeModelStratified, outcomeModelUseCovariates, inversePtWeighting, outcomeModelTreatmentEstimate, outcomeModelmainEffectEstimates, outcomeModelInteractionEstimates, outcomeModelStatus, populationCounts, outcomeCounts, timeAtRisk, subgroupCounts.
coef
would overload stats::coef
and confint
would overload stats::confint
.
Regarding attrition, I think we could handle it in a similar manner:
Attrition (MetaDataPS) |
---|
Attributes |
Methods |
+ initialize |
- validate |
The fields
data.frame
or list
would contain: description, targetPersons, comparatorPersons, targetExposures, comparatorExposures, rowCount, treatment, personCount.
A couple of choices that need to be made:
- Do we allow fields to be freely editable, as a public attribute?
- Generalize Class and method naming, as this extends beyond Metadata.
- Are there any listed
fields
attributes that would be useful to have in multiple classes?
Should MetaDataPS
inherit from MetaDataLoading
, instead of the other way around? This would follow the order in which they are created: getDbCohortMethodData()
would create MetaDataLoading
, and createPs()
would extend that with the attributes (and any methods) to represent the PS meta-data. The outcome model meta data would extend the MetaDataPS. And whatever is the top class would need to have the attrition attribute.
In response to your questions:
-
If we're going to be strict (which I propose we do), meta-data should be set when the object is created (via the constructor), and then not be allowed to be modified. (so only getters, not setters). However, we do want the attrition table to grow over time, which can be viewed as modification although we shouldn't touch earlier entries in the table.
-
In general I try to avoid making up new things, so I probably would name the metadata classes after the functions that generate them. That would lead to long names though, like 'GetDbCohortMethodDataMetaData` :-( Note that currently the class name 'OutcomeModel' conflicts with this.
-
I don't think so, but I guess that depends on what inherits from what.
Should
MetaDataPS
inherit fromMetaDataLoading
, instead of the other way around? This would follow the order in which they are created:getDbCohortMethodData()
would createMetaDataLoading
, andcreatePs()
would extend that with the attributes (and any methods) to represent the PS meta-data. The outcome model meta data would extend the MetaDataPS. And whatever is the top class would need to have the attrition attribute.
So my reasoning as to why the classes are setup like that, is that MetaDataPS is the most simple of the bunch, and we'd extend the child classes where needed with additional methods, i.e: MetaDataLoading: formatStudyDates()
and for OutcomeModel: coef()
, confint()
.
If we'd implement as you propose, method formatStudyDates from MetaDataLoading would also be inherited to all child classes, but I don't think we would need this method in any of them.
It boils down to an organizational choice to keep methods in places where they're needed.
In response to your questions:
- If we're going to be strict (which I propose we do), meta-data should be set when the object is created (via the constructor), and then not be allowed to be modified. (so only getters, not setters). However, we do want the attrition table to grow over time, which can be viewed as modification although we shouldn't touch earlier entries in the table.
I agree, we can add an appendAttrition(data.frame)
method to the attrition class, that would just add new rows.
- In general I try to avoid making up new things, so I probably would name the metadata classes after the functions that generate them. That would lead to long names though, like 'GetDbCohortMethodDataMetaData` :-(
Generally I agree, I don't know if we should just bite the bullet on this. The class only really contains a targetId, comparatorId, studyStartDate, and studyEndDate. So I'll do one more suggestion: StudyMetaData. Otherwise it can just be GetDbCOhortMethodDataMetaData as a working name.
Note that currently the class name 'OutcomeModel' conflicts with this.
I thought we'd eventually replace the S3 implementation with the R6 one, we can name it OutcomeModelR6 for now, as the idea of it is the same, the implementation is different.
- I don't think so, but I guess that depends on what inherits from what.
But the meta-data coming out of the createPs() function will also contain all the meta-data that came out of getDbCohortMethodData(), right? The trail keeps growing as more data is applied. So the createPs metadata should have a formatStudyDates() function because it will contain the study dates used to create the CohortMethodData that was the input to createPs.
Why would the class for the meta-data for the outcome model be called 'OutcomeModel'? I thought we'd distinguish between the data itself and its meta-data, but perhaps you're thinking of those being represented by a single class?