StructuralEquationModels.jl icon indicating copy to clipboard operation
StructuralEquationModels.jl copied to clipboard

Enhance StatsAPI integration

Open alyst opened this issue 11 months ago • 8 comments

One problem that I have run into is that both SEM and StatsAPI define params(), so if the user is using both in the same session (StatsAPI e.g. via Distributions), there would be a collision. One solution is just to import StatsAPI and use their params() method. (The context is very similar, however params() in StatsAPI is supposed to return the actual values of the parameters, whereas SEM one returns their names. There is also StatsAPI.coefnames() method.)

The other methods of StatsAPI could be reused as well, e.g. StatsAPI.dof() (instead of SEM.df()), pvalue(), "fit()*, etc. Overall, that would make SEM.jl better integrated into the Julia's ecosystem of statistical models.

alyst avatar Jan 15 '25 21:01 alyst

@alyst I want to get a bit more involved/helpfull in SEM.jl and this seems like a issue I could tackle. Any PR of me would probably require a careful eye. Should I take a stab at it?

aaronpeikert avatar Jan 24 '25 11:01 aaronpeikert

@aaronpeikert Sure, that would be great! I can help you if you need my review.

alyst avatar Jan 24 '25 18:01 alyst

OK this seems a bit complicated and some substantive decisions need to be made:

  • [ ] StatisticalModel --> abstract type AbstractSem <: StatsAPI.StatisticalModel end
  • [x] RegressionModel --> dont use. all methods except for fitted dont make any sense (assume a single dependent variable)
  • [x] params: StatsAPI defines this as all parameters, SEM.jl as all identifiers --> conform to StatsAPI
  • [ ] coef: StatsAPI defines this as all parameters that are required for prediction (e.g., it excludes residual variance for a linear model); distinction to params makes no sense in SEM --> define as synonym
  • [ ] coeftable --> synonym to parametertable, requires confint
  • [ ] confint --> implement CI calculation, maybe only availible when Distribtions is loaded via package extention
  • [ ] coefnames: StatsAPI's name for SEM.jl's params --> conform to StatsAPI
  • [x] paramsnames: does not exist in StatsAPI but should exist if the define coef === params --> synonym to coefnames
  • [ ] isfitted: Was the model fitted to the data? --> implement
  • [ ] fitted --> check isfitted and return coef(model) otherwise argerror
  • [ ] predict --> don't implement (see below)
  • [ ] fit what was again the reason for sem_fit instead of fit --> rename sem_fit
  • [ ] pvalue --> come up with implementation that does not depend on Distributions.jl
  • [ ] loglikelihood(model::StatisticalModel, observation) the contribution of a single sample (here called observation). Not sure this is easy.
  • [ ] nullloglikelihood what definition of null model do we use? Is is trivial to autogenerate?
  • [ ] score: gradient of the log-likelihood with respect to the coefficients --> implement
  • [ ] nobs --> Throw argument error, we use nsample
  • [ ] dof --> rename df
  • [ ] mss --> don't implement because it refers to something different for SEMs than for regression models
  • [ ] informationmatrix --> return if hessian is availible
  • [ ] vcov Return the variance-covariance matrix for the coefficients of the model. --> return if hessian is availible
  • [ ] BIC, AIC, AICc --> only the last needs to be newly implemented?
  • [ ] islinear --> dispatch on imply should suffice
  • [ ] ALL other functions defined for StatisticalModel should throw a MethodError

Should I go ahead with this plan?

aaronpeikert avatar Jan 25 '25 11:01 aaronpeikert

It is a nice plan! Some comments:

  1. StatsAPI.jl support could be implemented in phases. We can classify the items according to these phases
    • The phase 1 would be to resolve current name clashes, like params, because if I'm using params from SEM.jl in my script/package, and then I add StatsAPI.jl (or one of the packages that re-export it), the script/package cannot be compiled.
    • The phase 2 would be to rename existing SEM.jl functions to match StatsAPI.jl or add synonyms to the StatsAPI.jl functions (e.g. SEM.df does not clash, but it could be StatsAPI.dof)
    • The phase 3 would be to implement as much of StatsAPI.jl functions as possible (where it makes sense).
  2. Some specific items:
  • AbstractSem <: StatisticalModel -- would be nice if some SEM.jl types are derived from StatsAPI, but there is also Sem and SemFit, so that may require more considerations. It looks like StatsAPI assumes that fitting modifies the existing statistical model (maybe that is inspired by R). That's not ideal from the flow of information PoV and complicates both the internal and the user script logic. I like how SEM.jl distinguishes between model definition Sem and fitted model SemFit, so SEM.jl may diverge from StatsAPI in that aspect.
  • SEM.params -> SEM.paramnames or SEM.coefnames: renaming to paramnames makes sense to me. The internals of SEM.jl (e.g. ParamsMatrix) do not rely on parameter IDs, rather their indices.
  • coef as params synonym: that's "phase 2", so we can decide on it later
  • confint -- would be nice to have, esp. since the manuscript tutorial has the user code to calculate it, but it would be more convenient to have it in the package
  • isfitted: Sem (as the model definition) can return false or throw an exception that one has to run fit(Sem); isfitted(SemFit) would be always true
  • fit: I like the idea that fit(Sem) returns SemFit
  • pvalue: why it should be independent from Distributions.jl?
  • score: phase 3
  • loglikelihood/nullloglikelihood: that's phase 3
  • nobs: could be synonym to nsamples (the reason SEM.jl uses nsamples is to disambiguate from observed variables). SEM.jl tutorials may use nsamples only, but the code may allow nobs for those familiar with StatsAPI
  • AICc: that's phase 3
  • predict: I think predict(SemFit, data) could be used to predict latent variable values for each sample. In fact, the support for predict is in my staging branch. imply/implied is SEM-specific, I think it should not be synonymous to predict.
  • throw MethodError for other functions: that would be necessary if SEM.jl types are derived from StatisticalModel, but this decision could be postponed until phase 3.
  1. It will also be nice to get the feedback from the StatsAPI.jl maintainers. While they may not have considered use-cases like SEM.jl originally, they may have some ideas how StatsAPI.jl could be tweaked to support it.

alyst avatar Jan 25 '25 18:01 alyst

I would implement coeftable as

coeftable(model::SemFit, specification::SemSpecification; level::Real=0.95,  ci_function = se_hessian, kwargs...)
  
end

Maximilian-Stefan-Ernst avatar Feb 10 '25 16:02 Maximilian-Stefan-Ernst

In SEM's, every variable is both predictor and outcome, so for new samples, it is not possible to predict something - I would therefore not implement this.

Maximilian-Stefan-Ernst avatar Feb 10 '25 16:02 Maximilian-Stefan-Ernst

Following Alexey's feedback, I would restructure the open points as:

Resolve Clashes / params renaming

  • [x] #239
  • [x] paramsnames: does not exist in StatsAPI but should exist if the define coef === params --> synonym to coefnames
  • [x] coefnames: StatsAPI's name for SEM.jl's params --> conform to StatsAPI
  • [x] coef: StatsAPI defines this as all parameters that are required for prediction (e.g., it excludes residual variance for a linear model); distinction to params makes no sense in SEM --> define as synonym

Conform to StatsAPI naming

  • [x] nobs --> define as synonym to nsamble
  • [x] dof --> rename df
  • [x] fit what was again the reason for sem_fit instead of fit --> rename sem_fit
  • [x] coeftable --> synonym to parametertable, requires confint -> throws error and refers to ParameterTable

Further integration with StatsAPI

  • [ ] How should we handle the distiction between Sem, SemFit and ParameterTable? What should be a subtype of StatsAPI.StatisticalModel ?
  • [ ] ALL other functions defined for StatisticalModel should throw a MethodError
  • [ ] implement AICc
  • [ ] informationmatrix --> return if hessian is availible
  • [ ] vcov Return the variance-covariance matrix for the coefficients of the model. --> return if hessian is availible
  • [ ] islinear --> dispatch on imply should suffice
  • [ ] isfitted: Was the model fitted to the data? --> implement
  • [ ] fitted --> check isfitted and return coef(model) otherwise
  • [ ] nullloglikelihood what definition of null model do we use? Is is trivial to autogenerate?
  • [ ] score: gradient of the log-likelihood with respect to the coefficients --> implement
  • [ ] loglikelihood(model::StatisticalModel, observation) the contribution of a single sample (here called observation). Not sure this is easy.
  • [ ] pvalue --> come up with implementation that does not depend on Distributions.jl
  • [x] predict --> don't implement because it doesn't have a well-defined meaning for SEMs
  • [ ] - confint --> implement CI calculation, maybe only available when Distribtions is loaded via package extention

Other

  • [x] mss --> don't implement because it refers to something different for SEMs than for regression models
  • [x] RegressionModel --> dont use. all methods except for fitted dont make any sense (assume a single dependent variable)

Maximilian-Stefan-Ernst avatar Mar 13 '25 17:03 Maximilian-Stefan-Ernst

How should we handle the distiction between Sem, SemFit and ParameterTable? What should be a subtype of StatsAPI.StatisticalModel ?

I opted to implement params and family only for parameter table, since for Sem and SemFit it was not immediatly clear. However, I have not subtyped ParameterTable as statistical model since we dont gain cool (or even any?) methods for free from that. Also ParameterTable is subtype of AbstractSpecification, complicating matters further.

aaronpeikert avatar Mar 14 '25 09:03 aaronpeikert