pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: bar a line plots are not aligned on the x-axis/xticks

Open diegodebrito opened this issue 1 year ago • 11 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(
    {
        "bars": {
            1: 0.5,
            2: 1.0,
            3: 3.0,
            4: 3.5,
            5: 1.5,
        },
        "pct": {
            1: 4.0,
            2: 2.0,
            3: 2.0,
            4: 2.0,
            5: 8.0,
        },
    }
)

ax=df["bars"].plot(kind="bar")
df["pct"].plot(kind="line", ax=ax,)

Issue Description

Bar and line plot are not aligned on the x-axis when plotting with Pandas. I saw some somewhat related issues, but they were not exactly this type of plot.

This is the plot generated from the sample code above:

image

Expected Behavior

Line also starts in index=1, and not index=2 like in the plot above.

Installed Versions

INSTALLED VERSIONS

commit : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673 python : 3.11.6.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252

pandas : 2.1.4 numpy : 1.26.2 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 23.2.1 Cython : None pytest : 7.4.3 hypothesis : None sphinx : 7.2.6 blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.3 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.17.2 pandas_datareader : None bs4 : None bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 14.0.1 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.11.4 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

diegodebrito avatar Dec 11 '23 19:12 diegodebrito

Thanks for the report, confirmed on main. Further investigations and PRs to fix are welcome!

rhshadrach avatar Dec 19 '23 17:12 rhshadrach

I'm new to the codebase but this caught my interest because I've never encountered it, but it would be really annoying if I did. I'm still trying to understand the code but while reading it I tried to get a sense of where to look by plotting a few graphs myself.

  • In the above example, if it's only lines that are plotted, they are plotted on the correct axes until a bar graph gets added. Then the lines all get shifted by 1.

  • Also if the axes ran from 0-4, the graph is plotted normally, ie with the bars and lines on the correct axes:

import pandas as pd

df = pd.DataFrame(
    {
        "bars": {
            0: 0.5,
            1: 1.0,
            2: 3.0,
            3: 3.5,
            4: 1.5,
        },
        "pct": {
            0: 4.0,
            1: 2.0,
            2: 2.0,
            3: 2.0,
            4: 8.0,
        },
        "test": {
            0: 8.0,
            1: 1.0,
            2: 2.0,
            3: 2.0,
            4: 10.0,
        },
    }
)

ax1 = df["pct"].plot(kind="line")
ax2 = df["bars"].plot(kind="bar")
ax3 = df["test"].plot(kind="line")

But if I change my axes for all 3 to 3-7, then I get a shift of 3 instead for the lines (starting at 6 instead of 3):

Figure_1

Code to reproduce this to save some editing:

import pandas as pd

df = pd.DataFrame(
    {
        "bars": {
            3: 0.5,
            4: 1.0,
            5: 3.0,
            6: 3.5,
            7: 1.5,
        },
        "pct": {
            3: 4.0,
            4: 2.0,
            5: 2.0,
            6: 2.0,
            7: 8.0,
        },
        "test": {
            3: 8.0,
            4: 1.0,
            5: 2.0,
            6: 2.0,
            7: 10.0,
        },
    }
)

ax1 = df["pct"].plot(kind="line")
ax2 = df["bars"].plot(kind="bar")
ax3 = df["test"].plot(kind="line")

Annnnnd consistent with expectations, we get a shift of -1 for the lines if the axes all start with -1.

sharonwoo avatar Dec 23 '23 01:12 sharonwoo

Opened an incomplete PR (logic only, I need to fix CI) for further discussion - this would fix the issue, but should we fix it in the first place?

Tldr, BarPlot uses self.tick_pos = np.arange(len(data)) to set x, and I've put in some logic that lets it do something similar to LinePlot if the data is a series.

sharonwoo avatar Dec 24 '23 09:12 sharonwoo

take

sharonwoo avatar Dec 26 '23 00:12 sharonwoo

Thanks for the investigations here. We also get incorrect results if all the Series used for the plots do not have the same index, e.g.

bars = pd.Series({"a": 0.5, "b": 1.0})
pct = pd.Series({"b": 4.0, "a": 2.0})
ax = bars.plot(kind="bar")
pct.plot(kind="line", ax=ax)

It does seem reasonable to me for users to expect the order of the bars is preserved, even for numeric indexes:

pct = pd.Series({1: 4.0, 10: 2.0, 3: 3.0})
pct.plot(kind="bar")

The bars currently appear with x-ticks 1, 10, 3.

It seems difficult to determine appropriate results when there are different indexes in the data in general. I wonder if when plotting and ax is provided we can just detect when the xticks do not agree.

rhshadrach avatar Jan 04 '24 04:01 rhshadrach

Thanks for the thoughtful comment, let me investigate further.

I found some related issues raised previously: https://github.com/pandas-dev/pandas/issues/55508, https://github.com/pandas-dev/pandas/issues/50508, https://github.com/pandas-dev/pandas/issues/48806

sharonwoo avatar Jan 04 '24 07:01 sharonwoo

Thanks for the investigations here. We also get incorrect results if all the Series used for the plots do not have the same index, e.g.

bars = pd.Series({"a": 0.5, "b": 1.0})
pct = pd.Series({"b": 4.0, "a": 2.0})
ax = bars.plot(kind="bar")
pct.plot(kind="line", ax=ax)

It does seem reasonable to me for users to expect the order of the bars is preserved, even for numeric indexes:

pct = pd.Series({1: 4.0, 10: 2.0, 3: 3.0})
pct.plot(kind="bar")

The bars currently appear with x-ticks 1, 10, 3.

It seems difficult to determine appropriate results when there are different indexes in the data in general. I wonder if when plotting and ax is provided we can just detect when the xticks do not agree.

I tried, and Matplotlib does this chart correctly:

import matplotlib.pyplot as plt
bars = {"a": 1.5, "b": 3.5}
pct = {"b": 4.0, "a": 2.0}

fig, ax = plt.subplots()

ax.bar(bars.keys(), bars.values(), color='blue')
ax.plot(pct.keys(), pct.values(), color='red')

plt.show()

Unknown

However for the out of order series, the PR implementation is similar to what Matplotlib currently does (ie bars are out of order):

import matplotlib.pyplot as plt
import pandas as pd

pct = pd.Series({1: 4.0, 10: 2.0, 3: 3.0})
plt.figure()
plt.bar(pct.index, pct.values, color='blue')
plt.show()

Unknown-2

sharonwoo avatar Jan 04 '24 10:01 sharonwoo

I haven't been able to find documentation on this, but as far as I can tell matplotlib's behavior is:

  • If a series of x values contains both numeric and non-numeric, raise.
  • If a series of x values is numeric, use the value to determine location
  • If a series of x values is non-numeric, use the position (0, 1, 2, ...) to determine the location

The third bullet point above extends to when a series of non-numeric values is added to the plot: if there is a symbol not yet seen, it is added as the next bar. If it's already been seen, it is stacked.

From the matplotlib docs, I think we may also need to consider "if x has units (e.g. datetime)" as a separate case. I've yet to see how that interacts with the logic above.

It seems to me we can replicate this behavior (MultiIndex entries treated as tuples and hence non-numeric). One thing I'm still wondering is why we don't just offload this logic to matplotlib's default behavior.

rhshadrach avatar Jan 04 '24 22:01 rhshadrach

Sorry, been a bit busy and under the weather lately (both concurrently sometimes with an ill toddler).

If I understand correctly, the scope of this bug fix has been expanded in order to align the pandas wrapper with what matplotlib does, except where bars are out of order (preserve out of order, which would differ from the matplotlib implementation?

sharonwoo avatar Jan 23 '24 06:01 sharonwoo

to align the pandas wrapper with what matplotlib does

The matplotlib behavior certainly seems very reasonable, better than the status quo for pandas, and currently has my support. That said, if there are issues with it, we do not necessarily have to align with it.

except where bars are out of order (preserve out of order, which would differ from the matplotlib implementation?

I don't think so. While a user might want to present 1, 10, 3 in that order for bar charts, it seems more valuable to prefer using numeric dtypes as indications of where on the x-axis to place the bar. This can support e.g. multiple bar charts stacked when they don't have the same exact index. In order for the user to then get 1, 10, 3 in that order, they just need to convert the integers into strings and that will give the desired behavior.

rhshadrach avatar Jan 23 '24 21:01 rhshadrach

Looks like I'll have some time starting from next Monday to take a serious crack at this.

sharonwoo avatar Feb 14 '24 14:02 sharonwoo

As long as x axis is not numeric it works fine.

import pandas as pd

df = pd.DataFrame( { "bars": [0.5, 1.0, 3.0, 3.5, 1.5], "pct": [4.0, 2.0, 2.0, 2.0, 8.0], "index":[1,2,3,4,5] } )

df['index']=df['index'].astype('object')

ax=df.plot(kind="bar",x='index',y='bars') df.plot(kind="line", ax=ax,x='index',y='pct')

Screenshot 2024-08-03 at 5 18 29 PM

maddytae avatar Aug 03 '24 21:08 maddytae

take

sdalmia11 avatar Aug 06 '24 21:08 sdalmia11