polars
polars copied to clipboard
The handlings of Null and NaN in `polars.spearman_rank_corr` are both wrong.
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of polars.
Issue Description
polars.pearson_corr will drop Null and propagate NaN , but currently polars.spearman_rank_corr gives wrong results in both cases.
Reproducible Example
import polars as pl
import numpy as np
df1 = pl.DataFrame({'a': [None,1,2],'b':[None,2,1]})
df2 = pl.DataFrame({'a': [np.nan,1,2],'b':[np.nan,2,1]})
print('spearman_rank_corr:')
print(df1.select(pl.spearman_rank_corr('a','b')))
print(df2.select(pl.spearman_rank_corr('a','b')))
print('\n')
print('pearson_corr:')
print(df1.select(pl.pearson_corr('a','b')))
print(df2.select(pl.pearson_corr('a','b')))
"""
The output is:
spearman_rank_corr:
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ f64 │
╞═════╡
│ 0.5 │
└─────┘
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ f64 │
╞═════╡
│ 0.5 │
└─────┘
pearson_corr:
shape: (1, 1)
┌──────┐
│ a │
│ --- │
│ f64 │
╞══════╡
│ -1.0 │
└──────┘
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ f64 │
╞═════╡
│ NaN │
└─────┘
"""
Expected Behavior
The code snippet
print('spearman_rank_corr:')
print(df1.select(pl.spearman_rank_corr('a','b')))
print(df2.select(pl.spearman_rank_corr('a','b')))
should print
spearman_rank_corr:
shape: (1, 1)
┌──────┐
│ a │
│ --- │
│ f64 │
╞══════╡
│ -1.0 │
└──────┘
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ f64 │
╞═════╡
│ NaN │
└─────┘
Installed Versions
Could you explain to me by which logic should it lead to those results? Why should the one return NaN and the other a -1?
Could you explain to me by which logic should it lead to those results? Why should the one return
NaNand the other a-1?
The pearson correlation and spearman rank correlation of [1,2] and [2,1] are both -1:
pandas.DataFrame.correxcludes NA/null values when computing pearson/spearman correlation. The descriptive statistics methods in Pandas are all written to account for missing data, https://pandas.pydata.org/docs/user_guide/missing_data.html#calculations-with-missing-data- I am new to Polars, it seems that
polars.pearson_correxcludes the null values from the calculation but with NaN values it just returns a NaN. This is also the default behavior for many statistics methods (sum,mean) in Polars, https://pola-rs.github.io/polars-book/user-guide/howcani/missing_data.html.
I am new to Polars, it seems that polars.pearson_corr excludes the null values from the calculation but with NaN values it just returns a NaN. This is also the default behavior for many statistics methods (sum, mean) in Polars, https://pola-rs.github.io/polars-book/user-guide/howcani/missing_data.html.
The pearson correlation and spearman rank correlation of [1,2] and [2,1] are both -1:
Right, I understand now. It seems we must drop the null before computing the rank values. For NaN I think we should poison/propagate indeed.
Yes, that is because NaN is not missing data and we try to follow the floating point spect.
Thanks for the explanations, the handlings of Null and NaN in polars.Expr.rank are also confusing:
print(df1.select(pl.all().rank()))
print(df2.select(pl.all().rank()))
"""
the output is:
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ f32 ┆ f32 │
╞═════╪═════╡
│ 1.0 ┆ 1.0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.0 ┆ 3.0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3.0 ┆ 2.0 │
└─────┴─────┘
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ f32 ┆ f32 │
╞═════╪═════╡
│ 3.0 ┆ 3.0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1.0 ┆ 2.0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.0 ┆ 1.0 │
└─────┴─────┘
"""
The rank method in Pandas has the na_option to handle NA/null values, Polars maybe also need a similar option.
It seems that Null is treated as the smallest number and NaN as the biggest.