[ENH] Zero-inflated distributions
Zero-inflated distributions are important in intermittent demand forecasting, so it would be nice to have these as explicit distributions, and/or a compositor.
Based on discussion with @tingiskhan in https://github.com/sktime/sktime/pull/8438.
I can think of three high-priority ones, for discussion:
- [ ] zero-inflated Poisson
- [ ] zero-inflated Negative Binomial
- [ ] zero-inflated compositor, i.e., inflates any distribution with zeros, e.g.,
ZeroInflated(any_distribution, zero_rate=0). This should have analytic formulae for all of the distribution defining methods.
A possible approach - though challenging - might be to write the zero-inflated compositor first, and then define zero-inflated Poisson and NB by delegation, using _DelegatedDistribution.
Come to think of it, ZeroInflated is just a special case of Mixture, between any_distribution and the Delta distribution (supported at zero) - still might make sense to implement it separately.
We could use the _DelegatedDistribution to delegate to a Mixture with Delta, though it may be that there is an efficiency loss in doing so. Also, Mixture distributions in general have no explicit formula for inverse cdf, why mixtures with Delta do, hence I do think we still should implement ZeroInflated from scratch.
I think we're in luck regarding the Hurdle distribution: https://www.statsmodels.org/dev/examples/notebooks/generated/count_hurdle.html .
Edit: Or perhaps not since statsmodels treats the distribution as a model rather than distribution.
good first issue, the Hurdle distribution can be taken as a template for this
Hey @fkiraly, I would love to work on this issue, I have reviewed the existing Hurdle implementation.
For Zero-inflated distributions we have to allow zeros coming from base distributions as well, unlike Hurdle. I was initially thinking using TruncatedDistribution with a bound slightly below zero (e.g. -np.finfo(float).eps) to ensure 0 remains included but would it make more sense to extend TruncatedDistribution with a parameter like inclusive=True (defaulting to False to preserve current behavior) so that we can properly model lower bound
Side note: While going through the Hurdle distribution code, I noticed that the CDF code would return 1 - p for x < 0, but shouldn't it be zero? I might be missing something here.
Side note: While going through the
Hurdledistribution code, I noticed that the CDF code would return1 - pforx < 0, but shouldn't it be zero? I might be missing something here.
@tingiskhan, is this a bug?
Hey @fkiraly, I would love to work on this issue, I have reviewed the existing
Hurdleimplementation. For Zero-inflated distributions we have to allow zeros coming from base distributions as well, unlikeHurdle. I was initially thinking usingTruncatedDistributionwith a bound slightly below zero (e.g. -np.finfo(float).eps) to ensure 0 remains included but would it make more sense to extendTruncatedDistributionwith a parameter likeinclusive=True(defaulting to False to preserve current behavior) so that we can properly model lower bound
Interesting idea - yes, I agree that this is an approach to consider, but please note that TruncatedDistribution is not the same as a min/max distribution (aka clipped distribution).
There are three different concepts here, for an interval $I = [a, b]$, a vanilla distribution $d$, let's assume a random variable $X$ is distributed according to $d$.
- the truncated distribution, which is $X$ conditional on $X\in I$. Note that for a continuous distribution $X$, this has no mass on $a$ and $b$, and is continuous itself.
- the clipped disrtibution, which is $max(min(X, b), a)$. In general this has mass on $a$ and $b$. The Hurdle distribution is a special case of this.
- the zero-inflated distribution, which is a mixture of a draw from $X$, and the value 0, with respective masses. A generalization of this can be the conditional from the first bullet point inflated by masses on $a$ and $b$.
In particular, the first two are distinct, but could be covered in a single distribution. However, this may confuse the user, so different classes may be of benefit - a clipped distribution, and a boundary inflated one.
To complicate things, all of these may produce special cases of clipped distributions inflated by boundary masses, but even if the disrtibution is the same, the parameterization will be different, depending which one of the three you pick!
Interesting idea - yes, I agree that this is an approach to consider, but please note that
TruncatedDistributionis not the same as a min/max distribution (aka clipped distribution).
Yes, and since the base distribution is conditional on 𝑋≥0 it should be truncated, right? #648 implements it with that assumption
the clipped disrtibution, which is $max(min(X, b), a)$. In general this has mass on $a$ and $b$. The Hurdle distribution is a special case of this.
I may be missing something here, but currently I view the Hurdle distribution as being closer to a truncated distribution rather than a clipped one. My understanding is that for hurdle since the values $x ≤ 0$ are not possible for base distribution, the base distribution is conditioned on this interval which feels aligned with truncation logic. Whereas for a clipped distribution, values outside the interval are still possible but are projected onto the boundary, meaning no probability mass is removed
Hm, from what I can see it's defined as the following
def _cdf(self, x):
is_positive = x > 0.0
prob_positive = self._truncated_distribution.cdf(x)
return np.where(
is_positive, (1.0 - self.p) + self.p * prob_positive, 1.0 - self.p
)
Meaning that x > 0 corresponds to p and x <= 0 is 1 - p, no?
Meaning that
x > 0corresponds topandx <= 0is1 - p, no?
Yes, but I think it should be 1 − p only at x = 0 ? For x < 0 the CDF should be 0
Right, that's correct - good catch!