Distributions.jl icon indicating copy to clipboard operation
Distributions.jl copied to clipboard

Separate multivariate distributions into a separate package?

Open ararslan opened this issue 8 years ago • 23 comments

Discussed elsewhere, opening an issue to track it.

Since they have a slightly different interface and could benefit from different optimizations, @andreasnoack has suggested we separate multivariate distributions (and probably mixture models as well) into a separate package and keep Distributions.jl for univariate distributions.

ararslan avatar Aug 21 '17 17:08 ararslan

Do they need to be in separate packages though or would submodules suffice for now?

rofinn avatar Aug 21 '17 17:08 rofinn

Is Pkg set up to fetch submodules when updating?

ararslan avatar Aug 21 '17 17:08 ararslan

Sorry, I meant a julia submodule, just to help organizing things better before making a bunch of repos. I imagine once they're sufficiently isolated we'll be able to make Distributions.jl a wrapper package that re-exports things from the other APIs (e.g., AbstractDistributions.jl, UnivariateDistributions.jl, MultivariateDistributions.jl)

rofinn avatar Aug 21 '17 17:08 rofinn

At least for the Gaussian distribution, I would like to not separate the one and the n-dimensional case into different objects and hard-code various implementation details (isotropic or not). I'd rather give μ::T and σ or σσ' and provide implementations which regardless whether μ is a number or a vector or a fixed vector - as long as norm(σ\(x-μ))^2 is meaningful - dispatch to do the right thing. Compare with https://github.com/mschauer/Bridge.jl/blob/master/src/gaussian.jl

mschauer avatar Aug 22 '17 15:08 mschauer

@mschauer The mathematician in me is somewhat sympathetic to the idea, but not sure how feasible and practical it actually is. For examplecdf is only really applicable to univariate distributions (yes, it can be defined for multivariate distributions, but it is typically too difficult to compute outside of simple cases).

Can you expand on your use case? I'm not sure what your example is trying to show.

simonbyrne avatar Aug 22 '17 18:08 simonbyrne

I guess that we could just overload cdf for the cases that can be supported?

andreasnoack avatar Aug 22 '17 18:08 andreasnoack

My code to simulate linear stochastic processes is indifferent about the underlying type say as long as randn(T) gives a Gaussian increment.

In general, if I can write randn(T) and σ*randn(T) + μ, I would like Normal(zero(T)) and Normal(μ, σ*σ') being able to reflect the known properties the corresponding distribution. This is not a case where lets say generality contradicts usefulness.

Example:

using Bridge, StaticArrays
Ts = [Float64, Complex{Float64}, SVector{2,Float64}]
Ps = [Wiener{T}() for T in Ts]
Xs = [sample([0.0, 0.5, 1.0], P)[end] for P in Ps]

gives

 1.0 => -0.36573           
 1.0 => -0.17426-0.234315im
 1.0 => [1.07265, -0.13575]

Suddenly in https://github.com/mschauer/Bridge.jl/blob/master/src/linpro.jl#L32 I cannot rely on Distributions to give the law of the increment.

mschauer avatar Aug 22 '17 19:08 mschauer

Thanks, that was useful, though I'm not sure it is really feasible in the current structure, e.g. we would lose the property that Normal <: UnivariateDistribution.

What would be the downside of simply dispatching on the appropriate distribution type, e.g. Normal, MvNormal, ComplexNormal, MatrixNormal (the latter 2 don't exist yet, but could in future).

simonbyrne avatar Aug 22 '17 22:08 simonbyrne

Actually ComplexNormal and MatrixNormal are good counter-examples: both are actually 3-parameter distributions: ComplexNormal has a mean, covariance and a pseudo-covariance, and MatrixNormal has a mean and left and right covariance matrices.

simonbyrne avatar Aug 22 '17 22:08 simonbyrne

One would really focus on the case σ*randn(T) + μ, which for complex values makes the mixing pseudo-covariance zero.

mschauer avatar Aug 23 '17 06:08 mschauer

I find it practical to have both in the same package, is this still an open discussion?

matbesancon avatar Apr 10 '19 10:04 matbesancon

Since nothing has happened, it should be fine to continue the discussion. While it is practical, it comes at a significant price.

  1. Many packages have Arpack as an unnecessary dependency because the multivariate distributions use PDMats which depends on Arpack.
  2. The multivariate distributions don't work with StaticArrays although it would be much better in most applications. Part of the reason is that we don't want to add another fairly compilation heavy dependency to Distributions.

andreasnoack avatar Apr 11 '19 07:04 andreasnoack

Static arrays can be supported to a large extend without depending on the StaticArray package I suppose.

mschauer avatar Apr 11 '19 09:04 mschauer

Yes using the right abstractions could let us avoid depending on anything heavy. Lots of PRs lately have been abound correct parametrization of types to increase interop. Maybe we could do something similar for PDMats? One pain point on this is that the PDMats package defines the interface for PD matrices, but also the heavy Arpack-dependent implementation

matbesancon avatar Apr 11 '19 10:04 matbesancon

The PDMats dependence could be replaced with an AbstractPDMats, at the cost of users of MvNormals providing actual factorizations for the covariance matrices themselves, which are then required to define unwhiten etc, as in https://github.com/mschauer/GaussianDistributions.jl (I am sorry for linking this repo so often in our discussions, honestly it is mostly a sketch.)

mschauer avatar Apr 11 '19 10:04 mschauer

Agreed, couldn't AbstracPDMats be defined in a separate package from the implementation? Or even better, moved to LinearAlgebra (not sure how possible this would be). Removing a heavy and not necessary dependency would greatly lighten Distributions

matbesancon avatar Apr 11 '19 10:04 matbesancon

Actually, I don't think we need a special type for this. PDness is a property of the values anyway in the same way that σ>0 is value-dependent for Normal. I think we could just have a multivariate normal type that is parameterized on Cholesky-like factor S which could be either a matrix or factorization as long as S'S=Σ.

andreasnoack avatar Apr 11 '19 10:04 andreasnoack

Yes, one has to be a bit careful, because one would either like to give S or a factorisation object for Σ, which of course informs about S, but is not S.

mschauer avatar Apr 11 '19 10:04 mschauer

Maybe #823 can help us here.

andreasnoack avatar Apr 11 '19 11:04 andreasnoack

to dispatch on covar_mat=... vs factorization = ...?

matbesancon avatar Apr 11 '19 11:04 matbesancon

No, as a way to specify that you pass in a covariance matrix instead of the "square root" of the covariance.

andreasnoack avatar Apr 11 '19 11:04 andreasnoack

I had planned to tackle this after #823: my general plan was to get rid of PDMats and just work with factorizations (or Diagonals), and make use of FillArrays.jl, but i'm pretty short on time at the moment...

simonbyrne avatar Apr 11 '19 18:04 simonbyrne

Revisiting this issue, reflecting on the experience with LogExpFunctions.jl. A year after the suggestion (https://github.com/JuliaStats/StatsFuns.jl/issues/46) to move it out of StatsFuns.jl 4 years ago, the code was transferred into a separate package.

Despite its small size, the package got quite a few PRs. For the authors, a smaller package is easier to contribute to. For the maintainers (especially @devmotion, who did a lot of work), a smaller codebase allows PRs to get reviewed and merged relatively quickly. I think this is the key factor; dependencies are kind of secondary, but of course it is always nice to have fewer of them.

Practically, I would recommend addressing #1139 first, then splitting into uni- and multivariate distributions.

tpapp avatar Jul 25 '22 08:07 tpapp