math icon indicating copy to clipboard operation
math copied to clipboard

Generate random 2d contingency table with given marginal totals

Open HDembinski opened this issue 2 years ago • 6 comments

Motivation

Sometimes you have pairs of values x, y and you want to know whether y is independent of x or not. People sometimes compute the correlation of x and y, but independence is a stronger statement than lack of correlation. It is possible to have distributions with zero correlation that are nevertheless not independent, e.g. when y = x^2 for x distributed symmetrically around 0.

Some tests, like the popular USP test, compute a test statistic and then use Monte-Carlo to compute the distribution of the test statistic under the null hypothesis (independence). For that, one needs to generate random 2D tables ("histograms") that have the same row and column totals as the original data. The probability distribution for such a table has been worked out in the 80'ies, and Patefield found an elegant non-trivial algorithm to randomly sample from this distribution.

Candidate implementation

I searched the net for implementations in C/C++ of Patefield's algorithm, but only found implementations under GPL license. I want to have the code under a less restrictive license and so I translated the published FORTRAN code to C from scratch and based my implementation on that. The code is currently licensed under BSD 3-clause. I would also like to release it under BSL and contribute it to Boost.Math.

My implementation already evolved from the original and has cool features that the original implementation lacks. The changes are documented in the code. Sorry for all the gotos, it is how the original code was written. In time, I am interested in removing the gotos step by step, but for now it is more important to have a baseline code that works.

In addition to Patefield's algorithm, I also want to add a simple shuffling algorithm. The shuffling algorithm requires additional memory, but is faster than Patefield in certain scenarios.

Implementation discussion

I am not yet intimately familiar with the distribution infrastructure in Boost.Math. On a quick glance, I could not find a distribution that returns more than a single number. For this algorithm, the random variate is a 2d array. We should allow the user to allocate the memory for this array only once and then run the algorithm as many times as they want, which should fill the user-provided object with new values. That does not seem to fit into the current design. How should this be approached?

HDembinski avatar Feb 15 '22 09:02 HDembinski