pingouin Pearson's r from compute_effsize =|= Pearson's r from convert

Pearson's r from compute_effsize =|= Pearson's r from convert_effsize

Open MRTanja opened this issue 1 year ago • 3 comments

I noticed that Pearson's r when calculated with compute_effsize is different from when it is converted from Cohen's d via convert_effsize:

df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('AB'))
d = pg.compute_effsize(df['A'],df['B'], paired=False, eftype='cohen')
r_calc = pg.compute_effsize(df['A'],df['B'], paired=False, eftype='r')
r_conv = pg.convert_effsize(d,'cohen','r')

-> r_calc =|= r_conv, not even close

Am I missing something in how the formula works or is this a bug?

Sep 17 '22 19:09 MRTanja

Hi @TanjaS91,

Thanks for opening the issue. This is a normal behavior since the conversion from r <--> d is an approximation and not an exact science. Furthermore, instead of using randint (which will not produce normally-distributed arrays), I suggest the following example:

import numpy as np
import pandas as pd
import pingouin as pg
from scipy.stats import pearsonr

np.random.seed(42)
x, y = np.random.multivariate_normal(mean=[1, 2], cov=[[1, 0.5], [0.5, 1]], size=100).T

pearsonr(x, y)[0] # 0.37611
r = pg.compute_effsize(x, y, paired=True, eftype="r")   # 0.37611
d = pg.compute_effsize(x, y, paired=True, eftype="cohen")  # -1.15651

# d to r
pg.convert_effsize(abs(d), "cohen", "r")  # 0.50058

# r to d
pg.convert_effsize(r, "r", "cohen")  # 0.8118

You can see in this example with a multivariate-normal distribution that the conversion is more accurate, even if still not perfect. Pingouin uses standard formulas for conversion. You will get similar results from this website: https://www.escal.site/

Please also consider that the Pearson correlation coefficient is a measure of effect size in itself, so most of the time there's no need to convert to a Cohen d effect size.

Hope this helps,

Thanks Raphael

Sep 19 '22 14:09 raphaelvallat

Hi Raphael,

thank you for the answer and example! Actually in the meantime I figured out what is happening: When r is used as an effect size, it is the correlation between group membership and value, so e.g. when conducting a t-test between a treatment and a control group, the effect size r should be calculated as the correlation between treatment and control values concatenated on the one hand and dummy variables for group membership on the other hand. This is exactly the value that results from pg.convert_effsize(abs(d), "cohen", "r") and what the conversion formula does, as it was stated by Cohen (1988) in the original book with the conversion formula:

"Each member in the two populations may be characterized by a pair of variables, the "score" on population membership (X) and the value of the other variable (Y), and the r between X and Y can then be found by any of the usual computing formulas for r (Hays, 1973, p. 63lf; Cohen & Cohen,1975, pp. 32-35), or more readily as the point biserial r (Cohen & Cohen,1975, p. 35ff). Investigators may prefer to think of effect sizes for mean differences in terms of r's, rather than d's, and they are related by r = d/(sqrt(d^2 + 4))".

However r = pg.compute_effsize(x, y, paired=True, eftype="r") calculates the "regular" correlation (e.g. the correlation of treatment and control values) which, as I understand, is not the one used as an effect size. Indeed it does not seem to make much sense to use the correlation to measure difference, but on the other hand the point-biserial correlation between value and group membership does indicate the difference.

So if anything, I believe pg.compute_effsize(x, y, paired=True, eftype="r") might need to be changed so that it calculates the point-biserial instead of the correlation between the two sets of values.

Additional info: This is also mentioned by Howell, 2006, in section 10.1 under "r_pb and effect size", by Mathur & van der Weele (2020) "An important, yet infrequently discussed, point is that this conversion was derived for a Pearson correlation computed between a binary exposure X and a continuous outcome Y , also called a “point-biserial” correlation.", in Lakens (2013), in McGrath & Meyer (2006) and discussed here: https://stats.stackexchange.com/questions/390549/converting-between-correlation-and-effect-size-cohens-d https://stats.stackexchange.com/questions/589128/converting-cohens-d-to-pearsons-r-ne-calculated-pearsons-r https://stats.stackexchange.com/questions/295570/why-does-pearsons-r-differ-from-the-converted-value-of-r-from-cohens-d

Sep 19 '22 15:09 MRTanja

@TanjaS91, that's right — thank you for catching this mistake! Going back to my previous example:

xy = np.hstack((x, y))
xy_bool = np.repeat([0, 1], 100)
r_biserial = pearsonr(xy_bool, xy)[0]  # 0.50247
d = pg.compute_effsize(x, y, paired=True, eftype="cohen")  # -1.15651
pg.convert_effsize(abs(d), "cohen", "r", nx=100, ny=100)  # 0.50247
pg.convert_effsize(r_biserial, "r", "cohen", nx=100, ny=100)  # 1.16234

wherescipy.stats.pearsonr(xy_bool, xy) gives the same output as scipy.stats. pointbiserialr(xy_bool, xy). Indeed, from Wikipedia:

The point-biserial correlation is mathematically equivalent to the Pearson (product moment) correlation; that is, if we have one continuously measured variable X and a dichotomous variable Y, rXY = rpb. This can be shown by assigning two distinct numerical values to the dichotomous variable.

Action items:

First, I don't think we should modify the compute_effsize function -- it should continue to calculate a simple Pearson correlation between x and y when calling compute_effsize(.., eftype="r"), which is what most users will expect. We could add however a eftype="r_pointbiserial".

Second, we should definitely add an explicit warning and improve the documentation in the convert_effsize function to make sure that users are made aware that the r in that example refers to a point-biserial correlation. Actually even better, we should probably change the input_type="r" to a more explicit name like input_type="r_pointbiserial".

Sep 22 '22 01:09 raphaelvallat

pingouin pingouin copied to clipboard

Pearson's r from compute_effsize =|= Pearson's r from convert_effsize

pingouin
pingouin copied to clipboard