[BUG] px.scatter "ols" not producing linear trend line
Describe your context Please provide us your environment, so we can easily reproduce the issue.
- replace the result of
pip list | grep dashbelow
dash 2.3.1
dash-auth 1.4.1
dash-bootstrap-components 1.0.2
dash-core-components 2.0.0
dash-html-components 2.0.0
dash-table 5.0.0
-
if frontend related, tell us your Browser, Version and OS
- OS: macOS Version 12.2.1 (21D62)
- Browser: Chrome
- Version: 100.0.4896.88 (Official Build) (x86_64)
Describe the bug
"ols" (original least squares) function to add a linear trend line is not producing a regression line. It is instead something closer to polynomial.
The code I'm using to create the graph is as follows:
elif beh_gph == 'ols':
dfg[date_frmt] = pd.to_datetime(dfg[date_frmt])
print(dfg)
fig = px.scatter(dfg, x=date_frmt, y="Episode_Count", color="Target",
labels={"Episode_Count": tally + " per Shift",
"Target": "Target",
"Yr_Mnth": "Date"},
trendline="ols", title="Aggregate Behavior Data: " + patient + " - " + today)
fig.update_xaxes(tickangle=45,)
fig.update_layout(template='plotly_white', hovermode="x unified")
Instead of a logistic regression line per the example here - https://plotly.com/python/linear-fits/
I'm getting this:
The x and y values are just floating point numbers and date values respectively.
The Plotly version is 5.7.0
Expected behavior
Linear regression line.
I updated to the latest dash (2.3.1) and the problem still persists...
@nicolaskruchten I can't quite tell what we're falling back on here but I'm guessing this just means ols trendlines don't support dates? Any hunch how hard this would be?
Possibly, it worked as expected when I ran the same code a Mac. This is currently being run on Windows 10. I can provide my entire codebase if that might help?
A full reproducible example would be great, yes. Simplified to the minimal case if you can. My hunch about what's happening here: we're not able to use dates as the x data in the curve fitting algorithm, so it's using row indices as the x data during fitting, but somehow the indices used are out of order on Windows. If that's the case, then even on Mac where the indices are ordered correctly, the fit looks right only because your dates happen to be evenly spaced.
This is meant to work even on non-evenly-spaced dates: dates are converted to floats and the regression happens there, then the ~X values are converted back into dates~ the original X values are provided to Plotly.js. I'll take a look sometime this week. The relevant code starts here https://github.com/plotly/plotly.py/blob/master/packages/python/plotly/plotly/express/_core.py#L331
Hi @nicolaskruchten... was there any update on this?
No update yet, no. Can you provide a fully runnable example including data please?
The standard test case I use for OLS with dates on the X axis is this px.scatter(px.data.stocks(indexed=True, datetimes=True), trendline="ols") and it looks as expected to me.
Also can you confirm the version of Plotly you are using? The latest is 5.7.0
Hello,
I'm not sure if this has been resolved but I am seeing a similar issue when trying to plot data that only has date values on the first of each month (though this is in a Jupyter environment and not Dash so I'm not sure if it is exactly the same case).
I've included the code below. This code does work properly when I run it in Google Colab, and there is a specific difference in the data that I don't understand (more below).
OS: Windows 10 v 20H2 build 19042.1826
- plotly 5.9.0
- jupyterlab 3.3.2
import pandas as pd
import plotly.express as px
import datetime
df = pd.DataFrame( {'Date': ['2018-01-01','2018-02-01','2018-03-01','2018-04-01','2018-05-01','2018-06-01','2018-07-01','2018-08-01','2018-09-01','2018-10-01'],
'Units' : [36.044379,31.036306,34.354977,33.189577,32.906101,35.679296,48.577445,53.967781,51.684226,32.638374]})
df['Date'] = pd.to_datetime(df['Date'])
df['Date_serial'] = [(d - datetime.datetime(1970,1,1)).days for d in df['Date']]
df['Datevalue'] = df['Date'].values.astype(int)
fig = px.scatter(df, x = 'Date', y = 'Units', trendline = 'ols', trendline_color_override = 'red')
fig2 = px.scatter(df, x='Date_serial', y = 'Units', trendline = 'ols', trendline_color_override = 'red')
fig.show()
fig2.show()
This produces two plots, the first of which uses the Datetime column and has a non-linear trendline. The second plot I converted the dates into a serialized format and the trendline is now linear.

But as I noted, the plots render as expected when I run them in Google Colab. The major difference between the results I get in my environment and what I get in Google Colab are the values in the DateValue field.
Colab results:
Date | Units | Date_serial | Datevalue
-- | -- | -- | --
2018-01-01 | 36.044379 | 17532 | 1514764800000000000
2018-02-01 | 31.036306 | 17563 | 1517443200000000000
2018-03-01 | 34.354977 | 17591 | 1519862400000000000
2018-04-01 | 33.189577 | 17622 | 1522540800000000000
2018-05-01 | 32.906101 | 17652 | 1525132800000000000
My results:
-- | -- | -- | --
2018-01-01 | 36.044379 | 17532 | 1581514752
2018-02-01 | 31.036306 | 17563 | -153812992
2018-03-01 | 34.354977 | 17591 | -612827136
2018-04-01 | 33.189577 | 17622 | 1946812416
2018-05-01 | 32.906101 | 17652 | 2068578304
I have no idea why the Datevalue numbers are so different, but I imagine the values being out of order is part of (or the entire) issue.
EDIT -- If I convert to int64 instead of int I get the same values as I see in Colab. It looks like this line in the Plotly code linked above converts to int, which I suspect produces the negative values for me:
if x.dtype.type == np.datetime64:
x = x.astype(int) / 10**9 # convert to unix epoch seconds
Colab environment: Python 3.7.13 Plotly 5.5.0 Pandas 1.3.5
My current environment: Python 3.10.0 Plotly 5.9.0 Pandas 1.4.2
I also replicated my error on an older environment: Python 3.8.3 Plotly 5.6.0 Pandas 1.0.5
Let me know if any other details would be helpful.
I can replicate this with current plotly 5.11.0, pandas 1.4.1, statsmodels 0.13.5 and Python 3.8.15 on Win10 Enterprise 64bit 22H2.
I ran the example code from @sdelu and got the same plots.
Likewise, the example px.scatter(px.data.stocks(indexed=True, datetimes=True), trendline="ols") looks like this:

By the way, this issue is not limited to "ols", so maybe the issue can be renamed to something along the lines of "broken trendlines with datetime x-axis". Here is how the the second "lowess" example from the documentation looks on my system:
import plotly.express as px
df = px.data.stocks(datetimes=True)
fig = px.scatter(df, x="date", y="GOOG", trendline="lowess", trendline_options=dict(frac=0.1))
fig.show()

Hm... it seems that @sdelu was on the right track. I changed this line from
x = x.astype(int) / 10**9 # convert to unix epoch seconds
to
x = x.astype(np.int64) / 10**9 # convert to unix epoch seconds
Now the examples all work just fine.
Fixed in 5.12!
