MultivariateStats.jl icon indicating copy to clipboard operation
MultivariateStats.jl copied to clipboard

RFC: Unified API

Open wildart opened this issue 4 years ago • 7 comments

Following #95, I looked at MV models/methods implemented in this package, trying to figure out what would be a type hierarchy and corresponding method interfaces for this package.

Here is a table of models and corresponding function names used by models.

Function \ Model CCA WHT ICA LDA FA PPCA PCA KPCA MDS
fit x x x x x x x x x
transform x x x x x x x x x
predict x
indim x x x x x x x x
outdim x x x x x x x x x
mean x x x x x x x ?
var x x ? ? ?
cov x ?
cor x
projection x x x x x x
reconstruct x x x x
loadings ? ? x x ? ? ?
eigvals ? ? ? ? x
eigvecs ? ? ? ? ?
length
size

I put ? where a possible implementation is missing or called differently.

So, I propose following type hierarchy

  • StatsBase.RegressionModel
    • Methods: CCA, LDA
    • Functions: fit, transfrom, indim, outdim, mean
    • Subtypes:
      • AbstractDimensionalityReduction
      • Functions: projection, var, reconstruct, loadings
      • Subtypes:
        • LinearDimensionalityReduction
          • Methods: ICA, PCA
        • NonlinearDimensionalityReduction
          • Methods: KPCA, MDS
        • LatentVariableModel or LatentVariableDimensionalityReduction
          • Methods: FA, PPCA
          • Functions: cov
  • StatsBase.AbstractDataTransform
    • Whitening
    • Functions: fit, transfrom, indim, outdim, mean, size

@nalimilan @ararslan Thoughts?

wildart avatar Oct 10 '19 19:10 wildart

That makes sense to me. Might be nice to have an abstract dimensionality reduction type in there that linear, nonlinear, and latent variable types can subtype.

ararslan avatar Oct 10 '19 19:10 ararslan

Might be nice to have an abstract dimensionality reduction type in there that linear, nonlinear, and latent variable types can subtype.

That would be AbstractDimensionalityReduction

wildart avatar Oct 10 '19 20:10 wildart

Whoops, don't know how I missed that...

ararslan avatar Oct 10 '19 21:10 ararslan

This seems great to me.

As my primary interest in this is for plotting, one thing I'd like to know is whether there's a common method for obtaining a vector that would be used in a plot. I'm not super knowledgeable about the terminology, but I think different things are commonly plotted for different dimensionality reductions. For MDS and PCA (I think), one is supposed to plot the eigenvectors scaled by the square of the eigenvalue.

But finding information on this has been a bit challenging for me, not knowing all of the jargon.

kescobo avatar Oct 19 '19 15:10 kescobo

Loadings are scaled eigenvectors. It will be easy to add them to every eigendecomposition-based method.

wildart avatar Oct 20 '19 06:10 wildart

Sounds like a good idea. Is the LinearDimensionalityReduction vs. NonlinearDimensionalityReduction useful? I guess it doesn't hurt, but in your plan it doesn't really make a difference AFAICT.

Also, shouldn't PCA implement loadings?

nalimilan avatar Oct 21 '19 12:10 nalimilan

Fantastic. What about things like LDA and CCA? I've definitely seen those plotted, but your schema above doesn't have loadings for those, cf.

I know this is somewhat orthogonal, I can open a separate issue if that would be useful. In any case, having unified APIs for this stuff will be fantastic.

kescobo avatar Oct 21 '19 16:10 kescobo