polars icon indicating copy to clipboard operation
polars copied to clipboard

`corr` with ignoring null values

Open xuJ14 opened this issue 2 years ago • 3 comments

Description

With the corrent polars.DataFrame.corr, it returns nulls when columns contain nulls. Sometimes, ignoring nulls could be a wanted option. Pandas has pandas.DataFrame.corr.

xuJ14 avatar Sep 12 '23 03:09 xuJ14

Our (newly enforced) policy is that nulls should whenever possible be treated as completely absent by default. The problem with correlation is that you can have a scenario where only one of the two variables is missing.

I think it makes sense to change the default to calculate the correlation only using those rows where both columns have a value. Perhaps a statistician who has a stronger understanding of how the correlation coefficient is used could weigh in on that?

orlp avatar Sep 12 '23 08:09 orlp

I agree with you. The current corr function is the same as check nulls and use your new corr function instead, which is simple to do. So a broader solution would always be welcomed.

xuJ14 avatar Sep 12 '23 08:09 xuJ14

Hi, agree with the above, this would be very nice to have the same implementation as pandas to handle nan values as well (https://github.com/pandas-dev/pandas/blob/d928a5cc222be5968b2f1f8a5f8d02977a8d6c2d/pandas/_libs/algos.pyx#L349 => nancorr).

AdrienDart avatar Jan 26 '24 16:01 AdrienDart