skpro icon indicating copy to clipboard operation
skpro copied to clipboard

[ENH] design discussion - `pdf` and `pmf` in distributions, discrete, continuous, and mixed

Open fkiraly opened this issue 11 months ago • 1 comments

This is a design discussion on how to handle pdf and pmf in distrubtions, which can be discrete, continuous (short for "absolutely continuous"), and mixed. Assuming domain on the real numbers, and distributions without singular component.

scipy handles these as follows:

  • pmf is present and pdf is not present, for discrete distributions.
  • pdf is present and pmf is not present, for continuous distributions.
  • no support for mixed distributions.

I think it would be more consistent with composition and unified interfaces a la sklearn if all distributions had all these methods, and they correspond to the measures in the Lebesgue decomposition. That is,

  • pmf and pdf are present in all distributions
  • the sum of measures implied by pmf and pdf is a probability measure

In particular, this would mean:

  • for discrete distributions, pmf sums to one, and pdf is always zero
  • for continuous distributions, pdf integrates to one, and pmf is always zero
  • for mixed distributions, integral of pdf and sum of pmf sum to one. In general, the pdf integral, or pmf sum are not equal to one.

Being faithful to the Lebesgue decomposition also has an advantage in mixtures: the pdf and pmf of a m = Mixture([d1, d2], [w1, w2]) has m.pdf = w1 * d1.pdf + w2 * d2.pdf, and m.pmf = w1 * d1.pmf + w2 * d2.pmf, irrespective of components d1, d2 being continuous, discrete, or mixed. (assuming w1 + w2 == 1).

In a sense, this seems to be the convention that treats all edge cases consistently.

Thoughts?

fkiraly avatar Mar 31 '24 16:03 fkiraly

Being faithful to the Lebesgue decomposition also has an advantage in mixtures: the pdf and pmf of a m = Mixture([d1, d2], [w1, w2]) has m.pdf = w1 * d1.pdf + w2 * d2.pdf, and m.pmf = w1 * d1.pmf + w2 * d2.pmf, irrespective of components d1, d2 being continuous, discrete, or mixed. (assuming w1 + w2 == 1). In a sense, this seems to be the convention that treats all edge cases consistently.

Yes, that is correct it will handle all edge cases irrespective of d1, d2 being continuous, discrete or mixed as whenever the distribution becomes discrete the pdf integrates to 0 in that interval only the pmf will contribute in that interval. And whenever the distribution becomes continuous in an interval the pmf sum will be 0 and only the pdf will contribute in that interval. So in case of mixed distribution m.pdf = w1 * d1.pdf + w2 * d2.pdf, and m.pmf = w1 * d1.pmf + w2 * d2.pmf will still be true. And m.pdf + m.pmf == 1 will also be true when we consider the whole interval ie (-inf, inf).

ShreeshaM07 avatar Mar 31 '24 17:03 ShreeshaM07