pandas icon indicating copy to clipboard operation
pandas copied to clipboard

Why does autocorrelation_plot() generate plots where auto-correlation proportionally decrease with lag?

Open dokteurwho opened this issue 8 years ago • 4 comments

Problem description

I can observe autocorrelation_plot() generates plots where value importance proportionally decrease with lag.

This can be visually verified in documentation example (https://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-autocorrelation) were auto-correlation decreases linearly with lag.

Looking into autocorrelation_plot() implementation the reason is r(h) closure where (data[h:] - mean)).sum() will proportionally decrease with h.

I would like to know the reason of this implementation choice that is not equivalent to applying autocorr() with different lag values.

I suspect the reason behind is to limit the importance of auto-correlation terms calculated with few values.

Thanks for your feedback.

dokteurwho avatar Jul 27 '17 14:07 dokteurwho

cc @TomAugspurger

gfyoung avatar Jul 27 '17 16:07 gfyoung

Hi guys. I published this jupyter notebook in which I suggest improvements to autocorrelation_plot along the lines of what is suggested in this issue. It would be great to hear your feedback before going through with a PR

shadiakiki1986 avatar Mar 19 '19 01:03 shadiakiki1986

From my view the current implementation is

  1. Wrong since it doesn't meet the definition of the autocorrelation.
  2. It is inconsistent to other implementations

This may also cause a lot of confusion since it is totally unexpected. My plea is to calculate the "real" mean instead of dividing by n. Not sure if a +-1 has to be added.

def r(h):
        # return ((data[: n - h] - mean) * (data[h:] - mean)).sum() / n / c0
        return ((data[: n - h] - mean) * (data[h:] - mean)).sum() / (n-h) / c0

Maybe it's a bug or maybe this way implemented that way to dampen the noisy tail of the plot.

I was confused by this. ;) Cheers

ghost avatar May 04 '22 05:05 ghost

I too was confused by this. Specifically, why my random walks consistently showed strong negative correlation for about two thirds of the chart (from lag-length about 40% of n to the end).

I made a comparison chart of manually calculating correlation, ran it over 100 random walks of length 1,000: image

Where the random walk is just x = rng.normal(size=n).cumsum()

I'm new to all this so I'm not sure if my version is right, but I'm pretty sure either the Pandas version is wrong, of the definition of 'autocorrelation' is a bit weird.

davidgilbertson avatar Aug 18 '22 01:08 davidgilbertson