pandas
pandas copied to clipboard
Why does autocorrelation_plot() generate plots where auto-correlation proportionally decrease with lag?
Problem description
I can observe autocorrelation_plot() generates plots where value importance proportionally decrease with lag.
This can be visually verified in documentation example (https://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-autocorrelation) were auto-correlation decreases linearly with lag.
Looking into autocorrelation_plot() implementation the reason is r(h) closure where (data[h:] - mean)).sum() will proportionally decrease with h.
I would like to know the reason of this implementation choice that is not equivalent to applying autocorr() with different lag values.
I suspect the reason behind is to limit the importance of auto-correlation terms calculated with few values.
Thanks for your feedback.
cc @TomAugspurger
Hi guys. I published this jupyter notebook in which I suggest improvements to autocorrelation_plot along the lines of what is suggested in this issue. It would be great to hear your feedback before going through with a PR
From my view the current implementation is
- Wrong since it doesn't meet the definition of the autocorrelation.
- It is inconsistent to other implementations
This may also cause a lot of confusion since it is totally unexpected. My plea is to calculate the "real" mean instead of dividing by n. Not sure if a +-1 has to be added.
def r(h):
# return ((data[: n - h] - mean) * (data[h:] - mean)).sum() / n / c0
return ((data[: n - h] - mean) * (data[h:] - mean)).sum() / (n-h) / c0
Maybe it's a bug or maybe this way implemented that way to dampen the noisy tail of the plot.
I was confused by this. ;) Cheers
I too was confused by this. Specifically, why my random walks consistently showed strong negative correlation for about two thirds of the chart (from lag-length about 40% of n to the end).
I made a comparison chart of manually calculating correlation, ran it over 100 random walks of length 1,000:

Where the random walk is just x = rng.normal(size=n).cumsum()
I'm new to all this so I'm not sure if my version is right, but I'm pretty sure either the Pandas version is wrong, of the definition of 'autocorrelation' is a bit weird.