bayeslite icon indicating copy to clipboard operation
bayeslite copied to clipboard

Docs say CORRELATION returns Pearson coeff for numeric columns, but returns R^2

Open versar opened this issue 8 years ago • 4 comments

Documentation is here: http://probcomp.csail.mit.edu/bayesdb/doc/bql.html#bql-expressions

Either functionality or doc should be changed.

versar avatar Oct 19 '16 21:10 versar

I wonder why we first compute r and then yield r^2 instead of just yielding r. I'm sure when I wrote that code I was simply mimicking past behaviour but I have no idea why it was that way.

riastradh-probcomp avatar Oct 19 '16 22:10 riastradh-probcomp

Pearson's coeff is useful b/c it states the direction of the effect, e.g. whether one variable is directly or inversely correlated to the other.

That said, one reason to yield r2 is that there is no "r" for categorical variables. A set of columns in the types of datasets we analyze will often include both numeric and categorical variables. If you want to plot both correlations on the same heatmap, then the correlation metric should be on the same scale for both stattypes. If numerical variables' coefficients can be expressed as R but categorical can only be expressed as R2, maybe that's why we chose to use R2 for both.

My vote is for the Pearson's coeff to be available somehow, in some circumstance, b/c it is what most researchers use and is what we would compare our Bayesian dependencies to. I'm not sure what the best way to set this up is, though, considering variable types are mixed.

Also, there is a judgment call to make on what to use for categorical variables, because there are multiple flavors of correlation coefficients. Once we make the judgment call, maybe it can be documented somewhere so the next person understands the rationale the previous person had. I don't know enough about this to have a strong preference myself.

versar avatar Oct 19 '16 23:10 versar

A simple solution is to implement both CORRELATION AND CORRELATION2, which really should rather be named PEARSON R and PEARSON R2 because "correlation" is a quite general term.

fsaad avatar Nov 05 '16 18:11 fsaad

The main purpose of CORRELATION is to make heat maps which we contrast with the much better-looking DEPENDENCE PROBABILITY heat maps, as a tool for finding plausibly related variables. In that respect, r^2 is more applicable than r, since we need some kind of consistent measure between all pairs of columns, and the orientation of the correlation is of less interest in these heat maps than the magnitude of the correlation.

riastradh-probcomp avatar Dec 28 '16 21:12 riastradh-probcomp