pandas
pandas copied to clipboard
BUG: bar a line plots are not aligned on the x-axis/xticks
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame(
{
"bars": {
1: 0.5,
2: 1.0,
3: 3.0,
4: 3.5,
5: 1.5,
},
"pct": {
1: 4.0,
2: 2.0,
3: 2.0,
4: 2.0,
5: 8.0,
},
}
)
ax=df["bars"].plot(kind="bar")
df["pct"].plot(kind="line", ax=ax,)
Issue Description
Bar and line plot are not aligned on the x-axis when plotting with Pandas. I saw some somewhat related issues, but they were not exactly this type of plot.
This is the plot generated from the sample code above:
Expected Behavior
Line also starts in index=1, and not index=2 like in the plot above.
Installed Versions
INSTALLED VERSIONS
commit : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673 python : 3.11.6.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252
pandas : 2.1.4 numpy : 1.26.2 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 23.2.1 Cython : None pytest : 7.4.3 hypothesis : None sphinx : 7.2.6 blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.3 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.17.2 pandas_datareader : None bs4 : None bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 14.0.1 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.11.4 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
Thanks for the report, confirmed on main. Further investigations and PRs to fix are welcome!
I'm new to the codebase but this caught my interest because I've never encountered it, but it would be really annoying if I did. I'm still trying to understand the code but while reading it I tried to get a sense of where to look by plotting a few graphs myself.
-
In the above example, if it's only lines that are plotted, they are plotted on the correct axes until a bar graph gets added. Then the lines all get shifted by 1.
-
Also if the axes ran from 0-4, the graph is plotted normally, ie with the bars and lines on the correct axes:
import pandas as pd
df = pd.DataFrame(
{
"bars": {
0: 0.5,
1: 1.0,
2: 3.0,
3: 3.5,
4: 1.5,
},
"pct": {
0: 4.0,
1: 2.0,
2: 2.0,
3: 2.0,
4: 8.0,
},
"test": {
0: 8.0,
1: 1.0,
2: 2.0,
3: 2.0,
4: 10.0,
},
}
)
ax1 = df["pct"].plot(kind="line")
ax2 = df["bars"].plot(kind="bar")
ax3 = df["test"].plot(kind="line")
But if I change my axes for all 3 to 3-7, then I get a shift of 3 instead for the lines (starting at 6 instead of 3):
Code to reproduce this to save some editing:
import pandas as pd
df = pd.DataFrame(
{
"bars": {
3: 0.5,
4: 1.0,
5: 3.0,
6: 3.5,
7: 1.5,
},
"pct": {
3: 4.0,
4: 2.0,
5: 2.0,
6: 2.0,
7: 8.0,
},
"test": {
3: 8.0,
4: 1.0,
5: 2.0,
6: 2.0,
7: 10.0,
},
}
)
ax1 = df["pct"].plot(kind="line")
ax2 = df["bars"].plot(kind="bar")
ax3 = df["test"].plot(kind="line")
Annnnnd consistent with expectations, we get a shift of -1 for the lines if the axes all start with -1.
Opened an incomplete PR (logic only, I need to fix CI) for further discussion - this would fix the issue, but should we fix it in the first place?
Tldr, BarPlot uses self.tick_pos = np.arange(len(data))
to set x, and I've put in some logic that lets it do something similar to LinePlot if the data is a series.
take
Thanks for the investigations here. We also get incorrect results if all the Series used for the plots do not have the same index, e.g.
bars = pd.Series({"a": 0.5, "b": 1.0})
pct = pd.Series({"b": 4.0, "a": 2.0})
ax = bars.plot(kind="bar")
pct.plot(kind="line", ax=ax)
It does seem reasonable to me for users to expect the order of the bars is preserved, even for numeric indexes:
pct = pd.Series({1: 4.0, 10: 2.0, 3: 3.0})
pct.plot(kind="bar")
The bars currently appear with x-ticks 1, 10, 3.
It seems difficult to determine appropriate results when there are different indexes in the data in general. I wonder if when plotting and ax
is provided we can just detect when the xticks do not agree.
Thanks for the thoughtful comment, let me investigate further.
I found some related issues raised previously: https://github.com/pandas-dev/pandas/issues/55508, https://github.com/pandas-dev/pandas/issues/50508, https://github.com/pandas-dev/pandas/issues/48806
Thanks for the investigations here. We also get incorrect results if all the Series used for the plots do not have the same index, e.g.
bars = pd.Series({"a": 0.5, "b": 1.0}) pct = pd.Series({"b": 4.0, "a": 2.0}) ax = bars.plot(kind="bar") pct.plot(kind="line", ax=ax)
It does seem reasonable to me for users to expect the order of the bars is preserved, even for numeric indexes:
pct = pd.Series({1: 4.0, 10: 2.0, 3: 3.0}) pct.plot(kind="bar")
The bars currently appear with x-ticks 1, 10, 3.
It seems difficult to determine appropriate results when there are different indexes in the data in general. I wonder if when plotting and
ax
is provided we can just detect when the xticks do not agree.
I tried, and Matplotlib does this chart correctly:
import matplotlib.pyplot as plt
bars = {"a": 1.5, "b": 3.5}
pct = {"b": 4.0, "a": 2.0}
fig, ax = plt.subplots()
ax.bar(bars.keys(), bars.values(), color='blue')
ax.plot(pct.keys(), pct.values(), color='red')
plt.show()
However for the out of order series, the PR implementation is similar to what Matplotlib currently does (ie bars are out of order):
import matplotlib.pyplot as plt
import pandas as pd
pct = pd.Series({1: 4.0, 10: 2.0, 3: 3.0})
plt.figure()
plt.bar(pct.index, pct.values, color='blue')
plt.show()
I haven't been able to find documentation on this, but as far as I can tell matplotlib's behavior is:
- If a series of x values contains both numeric and non-numeric, raise.
- If a series of x values is numeric, use the value to determine location
- If a series of x values is non-numeric, use the position (0, 1, 2, ...) to determine the location
The third bullet point above extends to when a series of non-numeric values is added to the plot: if there is a symbol not yet seen, it is added as the next bar. If it's already been seen, it is stacked.
From the matplotlib docs, I think we may also need to consider "if x has units (e.g. datetime)" as a separate case. I've yet to see how that interacts with the logic above.
It seems to me we can replicate this behavior (MultiIndex entries treated as tuples and hence non-numeric). One thing I'm still wondering is why we don't just offload this logic to matplotlib's default behavior.
Sorry, been a bit busy and under the weather lately (both concurrently sometimes with an ill toddler).
If I understand correctly, the scope of this bug fix has been expanded in order to align the pandas wrapper with what matplotlib does, except where bars are out of order (preserve out of order, which would differ from the matplotlib implementation?
to align the pandas wrapper with what matplotlib does
The matplotlib behavior certainly seems very reasonable, better than the status quo for pandas, and currently has my support. That said, if there are issues with it, we do not necessarily have to align with it.
except where bars are out of order (preserve out of order, which would differ from the matplotlib implementation?
I don't think so. While a user might want to present 1, 10, 3 in that order for bar charts, it seems more valuable to prefer using numeric dtypes as indications of where on the x-axis to place the bar. This can support e.g. multiple bar charts stacked when they don't have the same exact index. In order for the user to then get 1, 10, 3 in that order, they just need to convert the integers into strings and that will give the desired behavior.
Looks like I'll have some time starting from next Monday to take a serious crack at this.
As long as x axis is not numeric it works fine.
import pandas as pd
df = pd.DataFrame( { "bars": [0.5, 1.0, 3.0, 3.5, 1.5], "pct": [4.0, 2.0, 2.0, 2.0, 8.0], "index":[1,2,3,4,5] } )
df['index']=df['index'].astype('object')
ax=df.plot(kind="bar",x='index',y='bars') df.plot(kind="line", ax=ax,x='index',y='pct')
take