partial_corr documentation / assertion wrong
The following minimum code example straight from documentation
https://pingouin-stats.org/build/html/generated/pingouin.partial_corr.html#pingouin.partial_corr Example 1
import pingouin as pg
df = pg.read_dataset('partial_corr')
pg.partial_corr(data=df, x='x', y='y', covar='cv1').round(3)
n r CI95% p-val
pearson 30 0.568 [0.25, 0.77] 0.001
returns:
AssertionError: columns are not in dataframe.
on line 843 from correlation.py:
assert all([c in data for c in col]), "columns are not in dataframe."
which kind of makes sense, because just before in line 842 we have:
col = _flatten_list([x, y, covar, x_covar, y_covar])
and since we don't supply x_covar y_covar when calling the function in the example, they default to None in function header, and None in not in the DataFrame by default:
In[664]: None in df Out[664]: False
Is this expected behaviour or something wrong with my Pandas? I'm using pandas=2.1.2,
partial_corr() assertion checks seem wobbly in general, e.g.:
assert x != covar, "x and covar must be independent"
assert y != covar, "y and covar must be independent"
These come where we yet not know whether covar is a list or a string. If it's a list, truth value is ambiguous.
I'd suggest following small changes: In 835:
assert x != y, "x and y must be independent"
if isinstance(covar, list):
assert x not in covar, "x and covar must be independent"
assert y not in covar, "y and covar must be independent"
else:
assert x != covar, "x and covar must be independent"
assert y != covar, "y and covar must be independent"
And then in 842:
# Check that columns exist
col = _flatten_list([x, y, covar, x_covar, y_covar])
col = [i for i in col if i is not None] # <-- add this
assert all([c in data for c in col]), "columns are not in dataframe."
without trimming the col list from None values, the assertion is triggered each time we don't have x_covar and y_covar - which we should not have at the same time, anyway.
Hope this helps somebody.
Also, note that in partial_corr() you drop NA's from your data, while in pcorr() you don't, which at best throws an error / some NA's in the result, and at worst is leading to different results in a situation which should lead to the same results.
Hi @mkeds
- The AssertionError occurs in Python 3.12 and has already been fixed in https://github.com/raphaelvallat/pingouin/pull/370, however I did not have the time to release a new version of Pingouin yet.
- The proposed update to the assertion check is a good idea. I'll implement it now.
- Feel free to open a PR to drop nan values in pingouin.pcorr as well, ideally with a minimal example showing that the output is different between the two functions.
Thanks, Raphael