[ENH] `concat` operation on distributions

Open fkiraly opened this issue 1 year ago • 1 comments

There should be a concat operation on distributions.

This will require:

a "concatenated distribution" compositor skpro.concat, similar to pd.concat
a dunder-like method in distributions to implement specific concat operations that allow to "flatten" concatenated distibutions of the same type

A direct consumer of this interface would be sktime.forecasting's predict_proba vectorization.

May 17 '24 11:05 fkiraly

Hackmd notes from the previous discussion

To view, click the dropdown arrow.

concat operation

Example usage

d1 = Normal(mu=[[1, 2], [3, 4]], sigma=1)  # 2 x 2
d2 = Normal(mu=0, sigma = [[2, 42]])  # 1 x 2

d = concat([d1, d2], axis=0)  # 3 x 2

This d should then be the same abstract distribution (not necessarily the same object) as if constructed direcctly by

d = Normal(
    mu = [[1, 2], [3, 4], [0, 0]],
    sigma = [[1, 1], [1, 1], [2, 42]],
)  # 3 x 2

(the repetition of 0s and 1s is due to broadcasting)

We observe the following:

pd.concat([d1.mean(), d2.mean()]) is the same as skpro.concat([d1, d2]).mean()
pd.concat([d1.var(), d2.var()]) is the same as skpro.concat([d1, d2]).var()
pd.concat([d1.pdf(x1), d2.pdf(x2)]) is the same as skpro.concat([d1, d2]).pdf(pd.concat([x1, x2]))

different distributions

SR - what happens if there are two different distributions, e.g., Normal or Laplace?

Example:

d1 = Normal(mu=[[1, 2], [3, 4]], sigma=1)  # 2 x 2
d2 = Laplace(mu=0, sscale=[[2, 42]])  # 1 x 2

d = concat([d1, d2], axis=0)  # 3 x 2

What is d?

FK - good question, I think it needs to be the "outer product" distribution, i.e., outer product of probability measures. This could be a separate compositee distribution object, and the same as

ConcatDistr([d1, d2], axis=0)

This happens whenever the two distribution types are different, i.e., we are not concating Normal with Normal or Laplace with Laplace.

mean and var behave as one would expect, same as above:

pd.concat([d1.mean(), d2.mean()]) is the same as skpro.concat([d1, d2]).mean()
pd.concat([d1.var(), d2.var()]) is the same as skpro.concat([d1, d2]).var()
pd.concat([d1.pdf(x1), d2.pdf(x2)]) is the same as skpro.concat([d1, d2]).pdf(pd.concat([x1, x2]))

Nothing would change here, except that ConcatDistr has to do the concatenations under the hood.

Implementation

FK: I would do two cases

First, detect whether all participating distributions are the same (type/class).

If yes, unwrap and concatenate the parameters, construct again. Perhaps allow only a certain set of distributions to behave lik ethis.

If no, wrap in ConcatDistr. This distribution type has special _mean, _var, _pdf etc, which applies these per component distribution, and then concatenates the result via pd.concat.

Thought: maybe there should be an option in concat, whether we always use ConcatDistr, or not (default?)

Nov 03 '24 05:11 SaiRevanth25