polars icon indicating copy to clipboard operation
polars copied to clipboard

The handlings of Null and NaN in `polars.spearman_rank_corr` are both wrong.

Open taozuoqiao opened this issue 3 years ago • 5 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of polars.

Issue Description

polars.pearson_corr will drop Null and propagate NaN , but currently polars.spearman_rank_corr gives wrong results in both cases.

Reproducible Example

import polars as pl
import numpy as np

df1 = pl.DataFrame({'a': [None,1,2],'b':[None,2,1]})
df2 = pl.DataFrame({'a': [np.nan,1,2],'b':[np.nan,2,1]})

print('spearman_rank_corr:')
print(df1.select(pl.spearman_rank_corr('a','b')))
print(df2.select(pl.spearman_rank_corr('a','b')))

print('\n')

print('pearson_corr:')
print(df1.select(pl.pearson_corr('a','b')))
print(df2.select(pl.pearson_corr('a','b')))

""" 
The output is:

spearman_rank_corr:
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ f64 │
╞═════╡
│ 0.5 │
└─────┘
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ f64 │
╞═════╡
│ 0.5 │
└─────┘


pearson_corr:
shape: (1, 1)
┌──────┐
│ a    │
│ ---  │
│ f64  │
╞══════╡
│ -1.0 │
└──────┘
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ f64 │
╞═════╡
│ NaN │
└─────┘
"""

Expected Behavior

The code snippet

print('spearman_rank_corr:')
print(df1.select(pl.spearman_rank_corr('a','b')))
print(df2.select(pl.spearman_rank_corr('a','b')))

should print

spearman_rank_corr:
shape: (1, 1)
┌──────┐
│ a    │
│ ---  │
│ f64  │
╞══════╡
│ -1.0 │
└──────┘
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ f64 │
╞═════╡
│ NaN │
└─────┘

Installed Versions

``` ---Version info--- Polars: 0.14.12 Index type: UInt32 Platform: Linux-2.6.32-220.el6.x86_64-x86_64-with-glibc2.10 Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0] ---Optional dependencies--- pyarrow: 9.0.0 pandas: 1.5.0 numpy: 1.22.4 fsspec: connectorx: xlsx2csv: ```

taozuoqiao avatar Sep 22 '22 08:09 taozuoqiao

Could you explain to me by which logic should it lead to those results? Why should the one return NaN and the other a -1?

ritchie46 avatar Sep 22 '22 10:09 ritchie46

Could you explain to me by which logic should it lead to those results? Why should the one return NaN and the other a -1?

The pearson correlation and spearman rank correlation of [1,2] and [2,1] are both -1:

  1. pandas.DataFrame.corr excludes NA/null values when computing pearson/spearman correlation. The descriptive statistics methods in Pandas are all written to account for missing data, https://pandas.pydata.org/docs/user_guide/missing_data.html#calculations-with-missing-data
  2. I am new to Polars, it seems that polars.pearson_corr excludes the null values from the calculation but with NaN values it just returns a NaN. This is also the default behavior for many statistics methods (sum, mean) in Polars, https://pola-rs.github.io/polars-book/user-guide/howcani/missing_data.html.

taozuoqiao avatar Sep 22 '22 11:09 taozuoqiao

I am new to Polars, it seems that polars.pearson_corr excludes the null values from the calculation but with NaN values it just returns a NaN. This is also the default behavior for many statistics methods (sum, mean) in Polars, https://pola-rs.github.io/polars-book/user-guide/howcani/missing_data.html.

The pearson correlation and spearman rank correlation of [1,2] and [2,1] are both -1:

Right, I understand now. It seems we must drop the null before computing the rank values. For NaN I think we should poison/propagate indeed.

Yes, that is because NaN is not missing data and we try to follow the floating point spect.

ritchie46 avatar Sep 22 '22 11:09 ritchie46

Thanks for the explanations, the handlings of Null and NaN in polars.Expr.rank are also confusing:

print(df1.select(pl.all().rank()))
print(df2.select(pl.all().rank()))

"""
the output is:

shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f32 ┆ f32 │
╞═════╪═════╡
│ 1.0 ┆ 1.0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.0 ┆ 3.0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3.0 ┆ 2.0 │
└─────┴─────┘
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f32 ┆ f32 │
╞═════╪═════╡
│ 3.0 ┆ 3.0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1.0 ┆ 2.0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.0 ┆ 1.0 │
└─────┴─────┘
"""

The rank method in Pandas has the na_option to handle NA/null values, Polars maybe also need a similar option.

taozuoqiao avatar Sep 22 '22 14:09 taozuoqiao

It seems that Null is treated as the smallest number and NaN as the biggest.

taozuoqiao avatar Sep 22 '22 14:09 taozuoqiao