math icon indicating copy to clipboard operation
math copied to clipboard

CDFs for discrete variables and PMFs only allow integer values

Open ethan-alt opened this issue 3 years ago • 20 comments

Description

Functions such as bernoulli_cdf and bernoulli_pmf only accept integer values. This is mathematically imprecise, as CDFs and PMFs should be defined for all real numbers (e.g., pbinom and dbinom in R accept non-integer values).

This is a problem, e.g., if you have a matrix Y of values where each column has a different distribution (e.g., first column is Gaussian, second column is Bernoulli). Such constructs are necessary to create a general Gaussian copula function.

Example

binomial_cdf(0.5, 0.75);

Expected Output

0.25

Currently, no code will compile if a non-integer value is input as the first argument to binomial_cdf(0.5, 0.75).

Current Version:

v4.0.1

ethan-alt avatar Apr 11 '21 18:04 ethan-alt

Thanks for reporting.

This is mathematically imprecise, as CDFs and PMFs should be defined for all real numbers (e.g., pbinom and dbinom in R accept non-integer values).

Stan's definition is decidedly non-standard. I'll take the blame with my obsession for type checking. I wasn't expecting anyone to check the values of a cdf outside the support of the pmf (or pdf), but this isn't the first time it's come up.

We could relax this with some work on the parser, code generator, math library, and documentation. We'd either have to make the cdf arguments non-differentiable or define derivatives for all of them (constant between support, undefined at support). That technically breaks the assumptions behind HMC, which requires smooth functions, but it can still be workable. And given that the derivatives are constant, they never play a role in guiding parameters.

Could you explain more about the use case where you expect the cdfs to be defined outside of the support of the pmf? An example you'd like to write in Stan that doesn't work because of this constraint would be most compelling. Do you need derivatives (they'll all be zero and they're not defined at the support of the pmf because of the discontinuity?

bob-carpenter avatar Apr 11 '21 18:04 bob-carpenter

The specific use case is that maybe we have a matrix, Y, containing Gaussian, Binary, and Poisson values, so Y has to be a real matrix. For the discrete variables, one must evaluate the CDFs at y and y-1 for a Gaussian copula.

It's possible to work around it by having two input matrices, but sadly the solution for tying it all together becomes rather unclean.

Derivatives would not be necessary.

ethan-alt avatar Apr 11 '21 19:04 ethan-alt

Thanks for the elaboration. I'm still trying to imagine what the Stan code would look like. The obstacle to encoding binary, count, and real-valued entries is that Stan doesn't have heterogeneous data structures like this. All matrices are real-valued in Stan and there's no way to convert real values to integers without losing the gradient information that drives Stan's HMC sampler.

Sorry to keep asking, but do you have a sketch of what the code would look like if Stan did extend the range of the CDFs to real values?

Depending on the distribution, a workaround might be to directly implement the cdf as a function in Stan. It's unlikely we'd be able to do this quickly even if the decision is made to change the support of cdfs.

After many years, I think you're right---we should've had the cdfs defined on all of R. It was my fault for defining them the way they are defined (and my excuse is that I was trained as a computer scientist, not a statistician). That still wouldn't solve problems with derivatives, but at least you'd be able to get the values.

bob-carpenter avatar Apr 14 '21 20:04 bob-carpenter

I think a function that allows for the evaluation of CDFs over the real line would work fine.

For background, I have attached some code that will not work, but what I was attempting to implement.

Basically, the user would input, say, a N x 3 matrix of normal, Bernoulli, and Poisson dependent variables (along with covariates, if applicable) and have an indicator of distribution type (e.g., 1 = normal, 2 = Bernoulli, 3 = Poisson).

For the purposes of this specific application, I've developed a workaround where the user must input separate arrays for each distribution type, but the code is not as clean. Still, I feel the option to at least evaluate the CDF over the real line would be a good implementation.

copula_glm.zip

ethan-alt avatar Apr 15 '21 13:04 ethan-alt

@bob-carpenter this PR to my helpful stan functions repo has the code and a reproducible R example. Same as attached by Ethan but possibly a bit cleaner.

https://github.com/spinkney/helpful_stan_functions/pull/13/commits/e603fb9c815186ce628cd441371c5270877afe55

spinkney avatar Apr 19 '21 20:04 spinkney

This is mathematically imprecise, as CDFs and PMFs should be defined for all real numbers (e.g., pbinom and dbinom in R accept non-integer values).

This is incorrect. Cumulative distribution functions and probability mass functions for distributions defined on discrete spaces are defined on only the discrete elements of the set. Even if one embeds those discrete spaces into the real numbers then the discrete probability mass function and cumulative distribution functions do not specify real-valued functions. Indeed there are an infinite number of real-valued functions that interpolate between any PMF/CDF at the discrete values for which they are defined.

Many software implementations of common PMF/CDF families, such as those in R, exploit analytic interpolations of discrete functions, such as the gamma function interpolation of the discrete factorial function. Admitting real-valued inputs is an artifact of the implementation, not a mathematical consequence of discrete probability theory.

A program that wants to overload discrete PMF/CDF families can just use the analytic real-valued interpolations directly — for example gamma_p and gamma_q are both exposed in the Stan modeling language.

betanalpha avatar May 03 '21 17:05 betanalpha

To @betanalpha's point this is pretty straightforward. Here's the overloaded binomial and poisson cdfs in R

> library(gsl)
> cpoisson_cdf <- function(x, lambda) {
+   gamma_inc_Q(x + 1, lambda) 
+ }
> 
> cpoisson_cdf(2, 5)
[1] 0.124652
> ppois(2, 5)
[1] 0.124652
> 
> cbinomial_cdf <- function(x, N, p) {
+   1 - beta_inc(x + 1, N - x, p)
+ }
> 
> cbinomial_cdf(2, 10, 0.3)
[1] 0.3827828
> pbinom(2, 10, 0.3)
[1] 0.3827828

spinkney avatar May 10 '21 10:05 spinkney

This is mathematically imprecise, as CDFs and PMFs should be defined for all real numbers (e.g., pbinom and dbinom in R accept non-integer values). This is incorrect. Cumulative distribution functions and probability mass functions for distributions defined on discrete spaces are defined on only the discrete elements of the set. Even if one embeds those discrete spaces into the real numbers then the discrete probability mass function and cumulative distribution functions do not specify real-valued functions. Indeed there are an infinite number of real-valued functions that interpolate between any PMF/CDF at the discrete values for which they are defined. Many software implementations of common PMF/CDF families, such as those in R, exploit analytic interpolations of discrete functions, such as the gamma function interpolation of the discrete factorial function. Admitting real-valued inputs is an artifact of the implementation, not a mathematical consequence of discrete probability theory. A program that wants to overload discrete PMF/CDF families can just use the analytic real-valued interpolations directly — for example gamma_p and gamma_q are both exposed in the Stan modeling language.

It is not incorrect. The definition of a mass function for a random variable X evaluated at value x is "the probability that X is equal to x, i.e., p(x) = P(X = x). For example if X is a binomial random variable with probability 0.25, p(0) = 0.75, p(1) = 0.25, p(0.5) = 0 because P(X = 0.5) = 0 since X can only be 0 or 1.

For CDFs, F(x) = P(X <= x) is similarly always well defined. In the above example, F(-1) = 0 because X has to be at least 0, F(0) = 0.75 because P(X <= 0) = P(X = 0) = 0.75 and F(1.5) = P(X <= 1.5) = P(X <= 1) = 1 since X cannot be larger than 1.

ethan-alt avatar May 10 '21 19:05 ethan-alt

The Wikipedia entry for probability mass functions has two definitions in the first two paragraphs. Their simple definition matches @ethan-alt's (although restricts to real numbers, not any old x) and their more measure-theoretic one matches @betanalpha's. Their definition of the Bernoulli PMF has exactly the domain {0, 1} and isn't extended to real numbers.

Setting all the mathematical generalizations aside, the question for our API isn't whether it's possible to extend the definition of a discrete pmf (or cdf) to all real numbers, but whether it's more helpful to restrict the domain or to catch potential errors. As a type-sensitive language designer, I went with the catch-errors-early approach. I'm not trying to argue it's the only way things could have been designed or even that it's the best design. I'm just explaining why I defined things that way to begin with.

Stan's underlying typing also requires an integer argument for a Bernoulli pmf. It won't compile if you try to give it a real number and will throw a runtime exception if it tries to evaluate the probability of an integer other than 0 or 1. I made that decision before Stan 1.0 to err on the side of catching errors as quickly as possible. Nevertheless, it's possible to define even a mixed continuous/discrete variable in Stan using mixtures (e.g., spike and slab prior with proper delta function spike).

bob-carpenter avatar May 10 '21 20:05 bob-carpenter

(from @ethan-alt) Basically, the user would input, say, a N x 3 matrix of normal, Bernoulli, and Poisson dependent variables (along with covariates, if applicable) and have an indicator of distribution type (e.g., 1 = normal, 2 = Bernoulli, 3 = Poisson).

This is how R works and I can see where you're trying to go with this. I decided to make Stan strongly typed and to distinguish integer and real values like in C++ and Java (more like Java in that you can't assign real values to integers). You're not the only one who wished we'd typed Stan like R.

My motivation was that it's generally considered bad form in numerical programming to use floating point for discrete problems or to even try to compare floating point numbers with == (for example, there are two zero values in standard IEEE floating point, which are == but do not have the same behavior everywhere).

bob-carpenter avatar May 10 '21 20:05 bob-carpenter

@bob-carpenter can't we get around that by defining new functions for the continuous/relaxed version of discrete cdfs? So we could have a "relaxed_poisson_cdf" or "continuous_approx_poisson_cdf" or whatever name people like the most.

spinkney avatar May 10 '21 20:05 spinkney

@ethan-alt: The definition of a mass function for a random variable X evaluated at value x is "the probability that X is equal to x, i.e., p(x) = P(X = x).

Technically true, but @betanalpha's point was that there are alternative "interpolations" that are commonly used. See for instance @spinkney's post right above yours that begins "this is pretty straightforward" and then gives the wrong definition. That sort of confusion can be avoided only by leaving the functions undefined outside of their natural domain.

The Stan language has enough combinatorial functions that you can build your own discontinuous CDFs if you need them.

real real_binomial_cdf(real x, int N, real p) {
  return 1 - inc_beta(floor(x)+1, N-floor(x), p);
}
real real_poisson_cdf(real x, real lambda) {
  return 1 - gamma_p(floor(x), lambda);
}

nhuurre avatar May 11 '21 09:05 nhuurre

It's not "wrong" it's the "continuous" or "interpolated" version that doesn't use floor...

spinkney avatar May 11 '21 11:05 spinkney

@bob-carpenter I agree whether PMFs should be extended along the real line is a matter of some debate (although it does mention in the article you posted that the Radon-Nikodym derivative is typically extended to the real line--so that even the measure-theoretic definition might include all real numbers), but it is not a debate for CDFs. CDFs are the only thing that discrete and continuous random variables have in common. It's fundamentally important that CDFs are defined on the entire real line. I will also say that the C code that powers dbinom and pbinom in R is set to be a double type. Nevertheless, I understand the discontinuity (no pun intended), and I'm not saying that what is currently done in Stan is "wrong," just that it "could be more accurate."

I certainly can understand that you don't want to break existing code, which is why I think @spinkney 's solution is a good one--perhaps have a separate function like binomial_cdf_real or something.

@nhuurre Of course, one can always build their own functions. But that's true of anything. Stan is not intended to be a general purpose language. It is intended to be utilized as a software package for people in applied statistics. As such, it makes sense that, at a minimum, the precise mathematical definitions of statistical functions are available.

I'm not on the dev team of course, so perhaps my vote counts the least in this discussion. But as an avid user and fan of Stan, who dabbles in statistical package development, I hope it counts for something.

ethan-alt avatar May 11 '21 12:05 ethan-alt

(@spinkey): It's not "wrong" it's the "continuous" or "interpolated" version that doesn't use floor...

Is it intended as a pdf rather than pmf, and if so, is it normalized? I've never quite understood these tactics, though there's a really neat paper by Will Grathwohl et al. that leans on this idea to add gradient-like information to discrete distributions that looks very promising for things like count variables: https://arxiv.org/abs/2102.04509

(@ethan-alt) CDFs are the only thing that discrete and continuous random variables have in common. It's fundamentally important that CDFs are defined on the entire real line.

(@ethan-alt) I will also say that the C code that powers dbinom and pbinom in R is set to be a double type.

dbinom() returns NaN if the count parameter is not an integer and throws a warning and returns 0 if the outcome is not an integer. Arguably this is all just to get around that R isn't big on exceptions (aka conditions in R) and tries its hardest to not "fail" (which I consider a huge drawback to writing robust code in R, but that's another topic). pbinom() behaves like a usual CDF, and thus also produces NaN and prints a warning if the count parameter isn't an integer.

(@ethan-alt) I certainly can understand that you don't want to break existing code,

It'd break existing unit tests and doc, but I doubt changing either the pmfs (which I'm opposed to) or cdfs (which I'm OK with) would have much effect on user code. Behavior wouldn't change for existing programs that worked other than that rather than an exception out of domain you'd get a 0 or 1 depending on which direction you left the support. I don't think I've ever seen a user program that relies on failure out of domain to drive anything. The reason I put it there in the first place is that it's usually a coding mistake.

(@ethan-alt) I'm not on the dev team of course, so perhaps my vote counts the least in this discussion.

We try to listen to everyone. The devs get votes on changes like this, but I'm not even sure who counts as a dev any more. Our governing body is talking about extending that to users somehow, but I don't think that's in place yet.

(@ethan-alt). Of course, one can always build their own functions. But that's true of anything.

Our main concern is not adding so many functions that it becomes impossible to find the one you need or decide among the ones that are there. And that it doesn't become a maintenance nightmare. We have a lot of distributions, so changes like this can be a programming nightmare and maintenance burden long term. I even wrote a blog post about maintenance costs to try to explain maintenance costs to statisticians. Normally we could introduce new functions and deprecate old ones, but I don't think we're going to change the names of our cdf functions at this point. So it'd have to be new function names or modification/overload of existing ones.

bob-carpenter avatar May 11 '21 16:05 bob-carpenter

(@spinkey): It's not "wrong" it's the "continuous" or "interpolated" version that doesn't use floor...

Is it intended as a pdf rather than pmf, and if so, is it normalized? I've never quite understood these tactics, though there's a really neat paper by Will Grathwohl et al. that leans on this idea to add gradient-like information to discrete distributions that looks very promising for things like count variables: https://arxiv.org/abs/2102.04509

It is a cdf not a pdf. A google search brought this paper up and the pdfs on page 4 Continuous Counterparts of Poisson and Binomial Distributions and their Properties by Andrii Iilenko.

spinkney avatar May 11 '21 16:05 spinkney

I agree whether PMFs should be extended along the real line is a matter of some debate (although it does mention in the article you posted that the Radon-Nikodym derivative is typically extended to the real line--so that even the measure-theoretic definition might include all real numbers),

This is not how Radon-Nikodym derivatives work.

A measure defined over the integers has no well-defined meaning for elements of a sigma-algebra defined over a real line. In particular such a measure is absolutely continuous with respect to the canonical counting measure, and it is the Radon-Nikodym derivative with respect to that counting measure that defines a unique probability mass function.

In order to construct some notion of a probability density function one first needs to map the integers into a real line, push the distribution forward along that map, and then take the Radon-Nikodym derivative of that pushforward distribution relative to the Lebesgue measure on that line. The critical part, however, is that there is no unique map from the integers into real line and hence no unique “extension” of a probability distribution defined over the integers to a probability distribution defined over a real line.

This all goes back to what has been said now multiple times — the “continuous” versions being requested are not uniquely defined by the discrete probabilities distributions. They are one of an infinite number of functions that are consistent with a discrete probability distribution.

Sloppier statistical pedagogy often ignores these distinctions but they matter. One the most beautiful features of the Stan modeling language is the rigid type systems which enforces these careful mathematical considerations.

but it is not a debate for CDFs. CDFs are the only thing that discrete and continuous random variables have in common. It's fundamentally important that CDFs are defined on the entire real line. I will also say that the C code that powers dbinom and pbinom in R is set to be a double type. Nevertheless, I understand the discontinuity (no pun intended), and I'm not saying that what is currently done in Stan is "wrong," just that it "could be more accurate."

Again, incorrect.

A cumulative distribution function is defined for a probability distribution defined on any ordered, one-dimensional space mapping the ambient space on which the probability distribution is defined to the real unit interval, X -> [0, 1]. In particular the cumulative distribution function defined over the integers maps the integers to the real unit interval, not real numbers.

All definitions of the cumulative distribution function require using the particular structure of the ambient space. For example consider the definition in terms of interval probabilities — for any x \in X define the interval I_x = [x_min, x] and the cumulative distribution function as Pi(x) = Prob[I_x]. The construction of the interval I_x depends on the particular ordering of the ambient space. Likewise in the sloppier probabilistic notation Pi(x) = Prob[ X < x ] the less than operator is defined on the particular ambient space.

This structure, however, is not preserved under general pushforwards and hence the structure of a discrete cumulative distribution function cannot be uniquely compared to the structure of a real cumulative distribution function.

Even under restrictions ambiguity remains. If we restrict consideration only to integer-to-real embeddings that preserve ordering and distance there is still excess freedom in how to map discrete distributions into real distributions and hence how to relate discrete cumulative distributions functions to continuous cumulative distributions functions.

The practical consequence of this is that there is an infinite number of ways to “interpolate” both probability mass functions and discrete cumulative distributions functions into real equivalents. For the Poisson distribution, for example, Gamma(x + 1, lambda) / Gamma(x + 1) and Gamma(floor(x + 1), lambda) / Gamma(floor(x + 1)) are just two of the possible infinite choices.

perhaps have a separate function like binomial_cdf_real or something.

Again this naming presumes that there is a unique real extension to the binomial cumulative distribution function which there is not. If a user wishes to use the incomplete beta function then awesome, they can use the inc_beta which is defined for real x in their Stan program indicating that they have made the explicit choice of this extension.

betanalpha avatar May 19 '21 01:05 betanalpha

@betanalpha thank you for providing that detailed explanation! Tbh I didn't quite understand how the original claim could be true, though I encountered the same ambiguity as @bob-carpenter on wikipedia. This helps clear things up for me. I believe the issue of adding this functionality to current CDFs is closed (though anyone can feel free to disagree).

There is a practical question to include real interpolated cdfs as additional functions or not. Though, as @betanalpha points out, it begs the question of which of the infinite interpolations we chose. For poisson and binomial, a natural choice is letting Gamma take in reals (see, eg, cbinomial where c for "continuous", though now I'd just say interpolated). But, do we really want to add this? I don't know. I do have a helpful functions repo where users can add their interpolated CDFs and others can access them there.

There's an open issue about adding interpolations to Stan (as in https://github.com/stan-dev/design-docs/pull/18). If or when those are added, one could also use those to define the interpolated cdf.

spinkney avatar May 19 '21 11:05 spinkney

@betanalpha you are incorrect.

A CDF is literally defined to be Pr(X <= x). In measure theory, they define a distribution function, which is not the same as a CDF in general. For example, a CDF is defined to be a nondecreasing function F such that F(x) --> 0 as x --> -infinity and F(x) --> 1 as x --> infinity. These boundary conditions make the probability function well defined.

While a distribution function in measure theory generalizes CDFs, practically speaking, they present problems. For example, let mu be the Lebesgue measure. Then mu(-inf, 0) doesn't make sense because mu(-inf, 0) = inf, so that the CDF is a constant function at infinity.

Similarly, a mass function is not defined to be the Radon-Nikodym derivative--that's the density in measure theory. A mass function has a different definition, namely p(x) = P(X = x).

So if you really wanted measure theory to rule Stan, then you should change _cdf to _df (for distribution function) and _pmf to _rnd for "Randon-Nikodym derivative"

ethan-alt avatar May 20 '21 13:05 ethan-alt

There is a practical question to include real interpolated cdfs as additional functions or not. Though, as @betanalpha https://github.com/betanalpha points out, it begs the question of which of the infinite interpolations we chose.

To be clear all of the “nice” interpolations that were being assumed have been available in Stan for some time under their mathematical names. We have tgamma for the real-valued Gamma function, gamma_p and gamma_q for the real-valued incomplete gamma functions, etc. By using these function names one makes the assumed interpolation, and any consequences thereof, explicit.

There might very well be other interpolations worth considering, but that’s a huge research question in its own right. The critical thing about this discussion is to not hide that question behind unstated assumptions!

A CDF is literally defined to be Pr(X <= x). In measure theory, they define a distribution function, which is not the same as a CDF in general. For example, a CDF is defined to be a nondecreasing function F such that F(x) --> 0 as x --> -infinity and F(x) --> 1 as x --> infinity. These boundary conditions make the probability function well defined.

I’ll try this one more time.

Yes a CDF on a one-dimensional, ordered space is defined as Pr(X <= x) but only for the definition of “<=“ appropriate to that ordering. In particular an ordering of the integers does not imply an ordering of the real numbers, even if one assumes a proper embedding.

No matter how you look at it one cannot uniquely map a distribution over the integers into a continuous distribution on the real line. Discrete ordering doesn’t define a continuous ordering, the dimensions aren’t the same so we can’t construct 1-1 maps, the topologies are fundamentally different so that the Borel sigma-algebras aren’t compatible, there’s infinitely less “information” in a discrete distribution than a continuous distribution, etc, etc.

There is no unique “generalization” of discrete distributions, or any representation thereof including cumulative distribution functions, to continuous distributions.

One can project continuous distributions into discrete distributions, but there will be many continuous distributions that map into the same discrete distribution. The multi-valued nature of comparing integers to real numbers cannot be avoided.

While a distribution function in measure theory generalizes CDFs, practically speaking, they present problems. For example, let mu be the Lebesgue measure. Then mu(-inf, 0) doesn't make sense because mu(-inf, 0) = inf, so that the CDF is a constant function at infinity.

The Lebesgue measure is a not sigma-finite, and hence not amenable to probabilistic objects like CDFs. This is not a problem of “measure theory” or “probability theory” or “(generalized) distribution functions” but rather trying to force a non-sigma-finite measure where it doesn’t belong. Indeed it’s only with formal measure theory that we can avoid so many probability “paradoxes” that arise when trying to force inappropriate measures in places where they don’t belong.

Similarly, a mass function is not defined to be the Radon-Nikodym derivative--that's the density in measure theory. A mass function has a different definition, namely p(x) = P(X = x).

Which are exactly the same thing on discrete spaces.

In any mathematical theory one can choose various different defining axioms provided that they are consistent, but some will be more useful than others. On a discrete space defining probability mass functions as the probability assigned to atomic sets is sufficient, but that falls apart on other spaces. It’s only with the more formal Radon-Nikodym derivative that we can construct a coherent theory of probability that spans discrete and continuous spaces and even make sense of questions like “does a discrete distribution uniquely define a continuous distribution” and vice-versa.

betanalpha avatar Jun 14 '21 20:06 betanalpha