pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: Pearson correlation outside expected range -1 to 1

Open madrjor02-bh opened this issue 1 year ago • 3 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

values = [{'col1': 30.0, 'col2': 116.80000305175781},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': 30.100000381469727, 'col2': 116.8000030517578},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None}]

data = pd.DataFrame(values)
data.corr(method='pearson')

Issue Description

In the code snipped I'm trying to calculate the correlation between a pair of columns. However, when using pearson correlation method for this particular example, the outputted correlation is outside the -1 to 1 expected range.

Expected Behavior

The output of the pearson correlation method should be inside the -1 to 1 range.

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.9.19.final.0 python-bits : 64 OS : Linux OS-release : 5.10.223-211.872.amzn2.x86_64 Version : #1 SMP Mon Jul 29 19:52:29 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : C.UTF-8 LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 69.5.1 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.18.1 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.9.2 numba : None numexpr : None odfpy : None openpyxl : 3.1.5 pandas_gbq : None pyarrow : 14.0.1 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.10.1 sqlalchemy : 2.0.31 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

madrjor02-bh avatar Aug 29 '24 19:08 madrjor02-bh

The bug does seem to exist in df.corr

Reason for the Unexpected Behavior

The reason is the Welford's method which is used for calculation which doesn't perform well when the datasets has extremely small differences or precision issues.

Finding Correlation using the standard Pearson Product-Moment Correlation

import pandas as pd
import numpy as np
import math

values = [{'col1': 30.0, 'col2': 116.80000305175781},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None},
 {'col1': 30.100000381469727, 'col2': 116.8000030517578},
 {'col1': None, 'col2': None},
 {'col1': None, 'col2': None}]

data = pd.DataFrame(values)

def corr_coef(X,Y):
    x1 = np.array(X)
    y1 = np.array(Y)
    x_m=x1.mean()
    y_m=y1.mean()
    numer=0
    v1sq=0
    v2sq=0
    for i in range(len(x1)):
        xx = (x1[i]-x_m)
        yy = (y1[i]-y_m)
        numer+=xx*yy
        v1sq+=xx*xx
        v2sq+=yy*yy
    return(numer/(math.sqrt(v1sq*v2sq)))

data = data.dropna()
corr_coef(data.iloc[:,0],data.iloc[:,1])

Result:

-0.7071067811865475

We get the same -0.707.. results if np.corrcoef() is used.

Though this is a rare case to happen in reality, I believe it should be fixed. I can work of fixing the issue by changing the Welford's method to the standard method for df.corr. It might slightly impact the time for running. Not sure if it will be acceptable by the team.

RaghavKhemka avatar Sep 04 '24 13:09 RaghavKhemka

Thanks for the report!

It might slightly impact the time for running.

If it is slight, I think the numerical stability would be valued.

rhshadrach avatar Sep 04 '24 21:09 rhshadrach

take

KevsterAmp avatar Sep 26 '24 10:09 KevsterAmp

take

ashlab11 avatar Nov 12 '24 22:11 ashlab11

I'm not sure if a fix has been implemented yet. df.corr returns 1 and -1. But I can take a look as this thread is almost a year old and has an obvious fix

eicchen avatar Oct 13 '25 20:10 eicchen

Looks like a bandaid fix was put in for keeping values with in -1~1 but the underlying forumla used is still Welford's method

eicchen avatar Oct 13 '25 20:10 eicchen

The sample only contains 2 values, as it's ignoring NAs. If the sample only contains 2 values, the correlation is only $-1$ or $1$, assuming all values are different.

Proof

$$ \begin{align} \bar X &= \frac{1}{2} [ X_1 + X_2] \ X_1 - \bar X &= \frac{X_1 - X_2}{2} \ \text{Cov}(X, Y) &= \frac{1}{2} [ (X_1 - \bar X)(Y_1 - \bar Y) + (X_2 - \bar X)(Y_2 - \bar Y)] \ &= \frac{1}{4}(X_1 - X_2)(Y_1 - Y_2) \ \sqrt{Var(X) Var(Y)} &= \frac{1}{4} | X_1 - X_2 | | Y_1 - Y_2| \ \text{Cor}(X, Y) &= \frac{(X_1 - X_2)(Y_1 - Y_2)}{|X_1 - X_2||Y_1 - Y_2|} \ &= \text{sign}(X_1 - X_2) \quad \text{sign}(Y_1 - Y_2) \end{align} $$


For me, the result in Pandas is correct.

Alvaro-Kothe avatar Oct 21 '25 23:10 Alvaro-Kothe

If you want to be sure, check with a datatype that supports more precision.

from decimal import Decimal, getcontext

def compute_correlation(a, b, precision=50):
    getcontext().prec = precision
    n = Decimal(len(a))
    a_dec = [Decimal(x) for x in a]
    b_dec = [Decimal(x) for x in b]

    sum_a = sum(a_dec)
    sum_b = sum(b_dec)
    sum_ab = sum(ai * bi for ai, bi in zip(a_dec, b_dec))
    sum_a2 = sum(ai * ai for ai in a_dec)
    sum_b2 = sum(bi * bi for bi in b_dec)


    cov_ab = sum_ab - (sum_a * sum_b / n)
    var_a = sum_a2 - (sum_a * sum_a / n)
    var_b = sum_b2 - (sum_b * sum_b / n)
    sd_product = var_a.sqrt() * var_b.sqrt()

    return cov_ab / sd_product

a = ["30.0", "30.100000381469727"]
b = ["116.80000305175781", "116.8000030517578"]
result = compute_correlation(a, b, precision=200)
print(result)
# output: -1

Alvaro-Kothe avatar Oct 21 '25 23:10 Alvaro-Kothe

If you run it through numpy the corr value is 0.7. It is only 1 or -1 because we are artificially clamping the outputs. They are infact, incorrect. Removing the clamp, the value would be -1.4

eicchen avatar Oct 21 '25 23:10 eicchen

numpy the corr value is 0.7.

If the result in numpy is 0.7, it doesn't mean it's correct...

In [6]: a = np.array([30.0, 30.100000381469727])

In [7]: b = np.array([116.80000305175781, 116.8000030517578])

In [8]: np.corrcoef(a, b)
Out[8]:
array([[ 1.        , -0.70710678],
       [-0.70710678,  1.        ]])

In [9]: a = np.array([1, 2])

In [10]: b = np.array([116.801, 116.8])

In [11]: np.corrcoef(a, b)
Out[11]:
array([[ 1., -1.],
       [-1.,  1.]])

Removing the clamp, the value would be -1.4

Nice, this is something we can improve, but it doesn't mean the result -1 is wrong.

Alvaro-Kothe avatar Oct 21 '25 23:10 Alvaro-Kothe

I understand that, but for this case we are looking at the unrounded values

eicchen avatar Oct 21 '25 23:10 eicchen

but for this case we are looking at the unrounded values

Can you clarify by what do you mean by unrounded value?

Alvaro-Kothe avatar Oct 21 '25 23:10 Alvaro-Kothe

It was in reference to the original comment you left tagging it as Closing Candidate justifying the issue. I was basically expressing that this is an issue as the expected value given the unrounded numbers should be 0.7 rather than 1 or -1. I have a fix in mind that shouldn't affect performance as much as just implementing two-pass so I'll work on getting that commited

eicchen avatar Oct 23 '25 18:10 eicchen

I was basically expressing that this is an issue as the expected value given the unrounded numbers should be 0.7 rather than 1 or -1.

I still don't understand why it should be 0.7.

Unless there is something wrong with my math in https://github.com/pandas-dev/pandas/issues/59652#issuecomment-3429875638, if $X_1 \neq X_2$ and $Y_1 \neq Y_2$ the pearson correlation should be $-1$ or $1$.

The numpy example in https://github.com/pandas-dev/pandas/issues/59652#issuecomment-3429902473 is incorrect due to limitations in floating point arithmetic. If you use a more precise floating type, you will get the correct result.

import numpy as np

a = np.array([30.0, 30.100000381469727], dtype=np.float128)
b = np.array([116.80000305175781, 116.8000030517578], dtype=np.float128)

print(np.corrcoef(a, b))
# [[ 1. -1.]
#  [-1.  1.]]

I have a fix in mind that shouldn't affect performance as much as just implementing two-pass so I'll work on getting that commited

Considering my arguments above, do you still think the result shouldn't be $-1$? If so, why?

Alvaro-Kothe avatar Oct 23 '25 19:10 Alvaro-Kothe

Ah, I see what you meant. Yeah I agree. However, I still think we should change the backend corr calculation. My thought was to run Welford until log(vx*vy) exceeds 14 or -14, then rerun using two-pass.

This should have minimal effect on most corr runs, and only effect ones that would've otherwise have failed. What do you think?

eicchen avatar Oct 23 '25 19:10 eicchen

My thought was to run Welford until log(vx*vy) exceeds 14 or -14, then rerun using two-pass.

Until where I know, Welford should be more stable than two-pass, because it mitigates round-off errors accumulation. Additionally, it should be more efficient, because it only iterates the array once.

But this is my assumption. I can be wrong :)

log(vx*vy) exceeds 14 or -14

For me, this looks arbitrary. You need to look for cancellation and absorption problems during the addition/subtraction, and there is no guarantee that the same problem won't show up in two-pass.

Alvaro-Kothe avatar Oct 23 '25 20:10 Alvaro-Kothe

Well, it's worth a shot. From my understanding, two-pass is more stable than Welford for extremely large numbers.

I picked 14 sigfigs because that's when float64 starts losing precision, which is likely where the stability issues we are currently seeing come from.

(Mentioned in the PR but I'll summarize it here for future reference. The co-moment equality from Welford doesn't hold in current implementation despite being mathematically sound. I believe it is a data type issue.)

eicchen avatar Oct 23 '25 20:10 eicchen

Well, it's worth a shot. From my understanding, two-pass is more stable than Welford for extremely large numbers.

Nice that you see room for improvement.

I picked 14 sigfigs because that's when float64 starts losing precision

It loses precision when the scale between the numbers differs too much, and you will have absorption problems.

>>> 1e16 + 1 == 1e16
True
>>> 1e16 + 1e14 == 1e16
False

Alvaro-Kothe avatar Oct 23 '25 21:10 Alvaro-Kothe