skpro icon indicating copy to clipboard operation
skpro copied to clipboard

[ENH] Zero-inflated distributions

Open fkiraly opened this issue 6 months ago • 11 comments

Zero-inflated distributions are important in intermittent demand forecasting, so it would be nice to have these as explicit distributions, and/or a compositor.

Based on discussion with @tingiskhan in https://github.com/sktime/sktime/pull/8438.

I can think of three high-priority ones, for discussion:

  • [ ] zero-inflated Poisson
  • [ ] zero-inflated Negative Binomial
  • [ ] zero-inflated compositor, i.e., inflates any distribution with zeros, e.g., ZeroInflated(any_distribution, zero_rate=0). This should have analytic formulae for all of the distribution defining methods.

A possible approach - though challenging - might be to write the zero-inflated compositor first, and then define zero-inflated Poisson and NB by delegation, using _DelegatedDistribution.

fkiraly avatar Jun 23 '25 20:06 fkiraly

Come to think of it, ZeroInflated is just a special case of Mixture, between any_distribution and the Delta distribution (supported at zero) - still might make sense to implement it separately.

We could use the _DelegatedDistribution to delegate to a Mixture with Delta, though it may be that there is an efficiency loss in doing so. Also, Mixture distributions in general have no explicit formula for inverse cdf, why mixtures with Delta do, hence I do think we still should implement ZeroInflated from scratch.

fkiraly avatar Jun 23 '25 20:06 fkiraly

I think we're in luck regarding the Hurdle distribution: https://www.statsmodels.org/dev/examples/notebooks/generated/count_hurdle.html .

Edit: Or perhaps not since statsmodels treats the distribution as a model rather than distribution.

tingiskhan avatar Jun 27 '25 07:06 tingiskhan

good first issue, the Hurdle distribution can be taken as a template for this

fkiraly avatar Aug 17 '25 16:08 fkiraly

Hey @fkiraly, I would love to work on this issue, I have reviewed the existing Hurdle implementation. For Zero-inflated distributions we have to allow zeros coming from base distributions as well, unlike Hurdle. I was initially thinking using TruncatedDistribution with a bound slightly below zero (e.g. -np.finfo(float).eps) to ensure 0 remains included but would it make more sense to extend TruncatedDistribution with a parameter like inclusive=True (defaulting to False to preserve current behavior) so that we can properly model lower bound

Khushmagrawal avatar Nov 22 '25 17:11 Khushmagrawal

Side note: While going through the Hurdle distribution code, I noticed that the CDF code would return 1 - p for x < 0, but shouldn't it be zero? I might be missing something here.

Khushmagrawal avatar Nov 24 '25 11:11 Khushmagrawal

Side note: While going through the Hurdle distribution code, I noticed that the CDF code would return 1 - p for x < 0, but shouldn't it be zero? I might be missing something here.

@tingiskhan, is this a bug?

fkiraly avatar Nov 24 '25 18:11 fkiraly

Hey @fkiraly, I would love to work on this issue, I have reviewed the existing Hurdle implementation. For Zero-inflated distributions we have to allow zeros coming from base distributions as well, unlike Hurdle. I was initially thinking using TruncatedDistribution with a bound slightly below zero (e.g. -np.finfo(float).eps) to ensure 0 remains included but would it make more sense to extend TruncatedDistribution with a parameter like inclusive=True (defaulting to False to preserve current behavior) so that we can properly model lower bound

Interesting idea - yes, I agree that this is an approach to consider, but please note that TruncatedDistribution is not the same as a min/max distribution (aka clipped distribution).

There are three different concepts here, for an interval $I = [a, b]$, a vanilla distribution $d$, let's assume a random variable $X$ is distributed according to $d$.

  • the truncated distribution, which is $X$ conditional on $X\in I$. Note that for a continuous distribution $X$, this has no mass on $a$ and $b$, and is continuous itself.
  • the clipped disrtibution, which is $max(min(X, b), a)$. In general this has mass on $a$ and $b$. The Hurdle distribution is a special case of this.
  • the zero-inflated distribution, which is a mixture of a draw from $X$, and the value 0, with respective masses. A generalization of this can be the conditional from the first bullet point inflated by masses on $a$ and $b$.

In particular, the first two are distinct, but could be covered in a single distribution. However, this may confuse the user, so different classes may be of benefit - a clipped distribution, and a boundary inflated one.

To complicate things, all of these may produce special cases of clipped distributions inflated by boundary masses, but even if the disrtibution is the same, the parameterization will be different, depending which one of the three you pick!

fkiraly avatar Nov 24 '25 18:11 fkiraly

Interesting idea - yes, I agree that this is an approach to consider, but please note that TruncatedDistribution is not the same as a min/max distribution (aka clipped distribution).

Yes, and since the base distribution is conditional on 𝑋≥0 it should be truncated, right? #648 implements it with that assumption

the clipped disrtibution, which is $max(min(X, b), a)$. In general this has mass on $a$ and $b$. The Hurdle distribution is a special case of this.

I may be missing something here, but currently I view the Hurdle distribution as being closer to a truncated distribution rather than a clipped one. My understanding is that for hurdle since the values $x ≤ 0$ are not possible for base distribution, the base distribution is conditioned on this interval which feels aligned with truncation logic. Whereas for a clipped distribution, values outside the interval are still possible but are projected onto the boundary, meaning no probability mass is removed

Khushmagrawal avatar Nov 25 '25 08:11 Khushmagrawal

Hm, from what I can see it's defined as the following

    def _cdf(self, x):
        is_positive = x > 0.0
        prob_positive = self._truncated_distribution.cdf(x)

        return np.where(
            is_positive, (1.0 - self.p) + self.p * prob_positive, 1.0 - self.p
        )

Meaning that x > 0 corresponds to p and x <= 0 is 1 - p, no?

tingiskhan avatar Nov 25 '25 10:11 tingiskhan

Meaning that x > 0 corresponds to p and x <= 0 is 1 - p, no?

Yes, but I think it should be 1 − p only at x = 0 ? For x < 0 the CDF should be 0

Khushmagrawal avatar Nov 25 '25 11:11 Khushmagrawal

Right, that's correct - good catch!

tingiskhan avatar Nov 25 '25 11:11 tingiskhan