Enhance StatsAPI integration
One problem that I have run into is that both SEM and StatsAPI define params(), so if the user is using both in the same session (StatsAPI e.g. via Distributions), there would be a collision. One solution is just to import StatsAPI and use their params() method. (The context is very similar, however params() in StatsAPI is supposed to return the actual values of the parameters, whereas SEM one returns their names. There is also StatsAPI.coefnames() method.)
The other methods of StatsAPI could be reused as well, e.g. StatsAPI.dof() (instead of SEM.df()), pvalue(), "fit()*, etc. Overall, that would make SEM.jl better integrated into the Julia's ecosystem of statistical models.
@alyst I want to get a bit more involved/helpfull in SEM.jl and this seems like a issue I could tackle. Any PR of me would probably require a careful eye. Should I take a stab at it?
@aaronpeikert Sure, that would be great! I can help you if you need my review.
OK this seems a bit complicated and some substantive decisions need to be made:
- [ ]
StatisticalModel-->abstract type AbstractSem <: StatsAPI.StatisticalModel end - [x]
RegressionModel--> dont use. all methods except for fitted dont make any sense (assume a single dependent variable) - [x]
params: StatsAPI defines this as all parameters, SEM.jl as all identifiers --> conform to StatsAPI - [ ]
coef: StatsAPI defines this as all parameters that are required for prediction (e.g., it excludes residual variance for a linear model); distinction to params makes no sense in SEM --> define as synonym - [ ]
coeftable--> synonym to parametertable, requires confint - [ ]
confint--> implement CI calculation, maybe only availible when Distribtions is loaded via package extention - [ ]
coefnames: StatsAPI's name for SEM.jl's params --> conform to StatsAPI - [x]
paramsnames: does not exist in StatsAPI but should exist if the define coef === params --> synonym to coefnames - [ ]
isfitted: Was the model fitted to the data? --> implement - [ ]
fitted--> check isfitted and return coef(model) otherwise argerror - [ ]
predict--> don't implement (see below) - [ ]
fitwhat was again the reason for sem_fit instead of fit --> renamesem_fit - [ ]
pvalue--> come up with implementation that does not depend onDistributions.jl - [ ]
loglikelihood(model::StatisticalModel, observation)the contribution of a single sample (here called observation). Not sure this is easy. - [ ]
nullloglikelihoodwhat definition of null model do we use? Is is trivial to autogenerate? - [ ]
score: gradient of the log-likelihood with respect to the coefficients --> implement - [ ]
nobs--> Throw argument error, we use nsample - [ ]
dof--> rename df - [ ]
mss--> don't implement because it refers to something different for SEMs than for regression models - [ ]
informationmatrix--> return if hessian is availible - [ ]
vcovReturn the variance-covariance matrix for the coefficients of the model. --> return if hessian is availible - [ ]
BIC,AIC,AICc--> only the last needs to be newly implemented? - [ ]
islinear--> dispatch on imply should suffice - [ ] ALL other functions defined for
StatisticalModelshould throw a MethodError
Should I go ahead with this plan?
It is a nice plan! Some comments:
- StatsAPI.jl support could be implemented in phases. We can classify the items according to these phases
- The phase 1 would be to resolve current name clashes, like
params, because if I'm usingparamsfrom SEM.jl in my script/package, and then I add StatsAPI.jl (or one of the packages that re-export it), the script/package cannot be compiled. - The phase 2 would be to rename existing SEM.jl functions to match StatsAPI.jl or add synonyms to the StatsAPI.jl functions (e.g.
SEM.dfdoes not clash, but it could beStatsAPI.dof) - The phase 3 would be to implement as much of StatsAPI.jl functions as possible (where it makes sense).
- The phase 1 would be to resolve current name clashes, like
- Some specific items:
AbstractSem <: StatisticalModel-- would be nice if some SEM.jl types are derived from StatsAPI, but there is alsoSemandSemFit, so that may require more considerations. It looks like StatsAPI assumes that fitting modifies the existing statistical model (maybe that is inspired by R). That's not ideal from the flow of information PoV and complicates both the internal and the user script logic. I like how SEM.jl distinguishes between model definitionSemand fitted modelSemFit, so SEM.jl may diverge from StatsAPI in that aspect.SEM.params->SEM.paramnamesorSEM.coefnames: renaming toparamnamesmakes sense to me. The internals of SEM.jl (e.g. ParamsMatrix) do not rely on parameter IDs, rather their indices.coefasparamssynonym: that's "phase 2", so we can decide on it laterconfint-- would be nice to have, esp. since the manuscript tutorial has the user code to calculate it, but it would be more convenient to have it in the packageisfitted:Sem(as the model definition) can returnfalseor throw an exception that one has to runfit(Sem);isfitted(SemFit)would be always truefit: I like the idea thatfit(Sem)returnsSemFitpvalue: why it should be independent from Distributions.jl?score: phase 3loglikelihood/nullloglikelihood: that's phase 3nobs: could be synonym tonsamples(the reason SEM.jl usesnsamplesis to disambiguate from observed variables). SEM.jl tutorials may usensamplesonly, but the code may allownobsfor those familiar withStatsAPIAICc: that's phase 3predict: I thinkpredict(SemFit, data)could be used to predict latent variable values for each sample. In fact, the support forpredictis in my staging branch.imply/impliedis SEM-specific, I think it should not be synonymous topredict.- throw
MethodErrorfor other functions: that would be necessary if SEM.jl types are derived fromStatisticalModel, but this decision could be postponed until phase 3.
- It will also be nice to get the feedback from the StatsAPI.jl maintainers. While they may not have considered use-cases like SEM.jl originally, they may have some ideas how StatsAPI.jl could be tweaked to support it.
I would implement coeftable as
coeftable(model::SemFit, specification::SemSpecification; level::Real=0.95, ci_function = se_hessian, kwargs...)
end
In SEM's, every variable is both predictor and outcome, so for new samples, it is not possible to predict something - I would therefore not implement this.
Following Alexey's feedback, I would restructure the open points as:
Resolve Clashes / params renaming
- [x] #239
- [x]
paramsnames: does not exist in StatsAPI but should exist if the define coef === params --> synonym to coefnames - [x]
coefnames: StatsAPI's name for SEM.jl's params --> conform to StatsAPI - [x]
coef: StatsAPI defines this as all parameters that are required for prediction (e.g., it excludes residual variance for a linear model); distinction to params makes no sense in SEM --> define as synonym
Conform to StatsAPI naming
- [x]
nobs--> define as synonym tonsamble - [x]
dof--> rename df - [x]
fitwhat was again the reason for sem_fit instead of fit --> renamesem_fit - [x]
coeftable--> synonym to parametertable, requires confint -> throws error and refers toParameterTable
Further integration with StatsAPI
- [ ] How should we handle the distiction between
Sem,SemFitandParameterTable? What should be a subtype ofStatsAPI.StatisticalModel? - [ ] ALL other functions defined for
StatisticalModelshould throw a MethodError - [ ] implement
AICc - [ ]
informationmatrix--> return if hessian is availible - [ ]
vcovReturn the variance-covariance matrix for the coefficients of the model. --> return if hessian is availible - [ ]
islinear--> dispatch on imply should suffice - [ ]
isfitted: Was the model fitted to the data? --> implement - [ ]
fitted--> check isfitted and return coef(model) otherwise - [ ]
nullloglikelihoodwhat definition of null model do we use? Is is trivial to autogenerate? - [ ]
score: gradient of the log-likelihood with respect to the coefficients --> implement - [ ]
loglikelihood(model::StatisticalModel, observation)the contribution of a single sample (here called observation). Not sure this is easy. - [ ]
pvalue--> come up with implementation that does not depend onDistributions.jl - [x]
predict--> don't implement because it doesn't have a well-defined meaning for SEMs - [ ] -
confint--> implement CI calculation, maybe only available when Distribtions is loaded via package extention
Other
- [x]
mss--> don't implement because it refers to something different for SEMs than for regression models - [x]
RegressionModel--> dont use. all methods except for fitted dont make any sense (assume a single dependent variable)
How should we handle the distiction between Sem, SemFit and ParameterTable? What should be a subtype of StatsAPI.StatisticalModel ?
I opted to implement params and family only for parameter table, since for Sem and SemFit it was not immediatly clear. However, I have not subtyped ParameterTable as statistical model since we dont gain cool (or even any?) methods for free from that. Also ParameterTable is subtype of AbstractSpecification, complicating matters further.