convert plotly-express to use narwhals
Description
This PR migrates plotly-express module logic from pandas to narwhals. In this way, pandas is not a required dependency for plotly-express (or at least for its entirety - e.g. trendlines will still require pandas for now) and users coming with polars, pyarrow or other eager dataframes supported in narwhals do not need to depend on pandas in the first place.
Related issue #4749
Code PR
- [x] I have read through the contributing notes and understand the structure of the package. In particular, if my PR modifies code of
plotly.graph_objects, my modifications concern thecodegenfiles and not generated files. - [x] I have added tests (if submitting a new feature or correcting a bug) or modified existing tests.
- [ ] For a new feature, I have added documentation examples in an existing or new tutorial notebook (please see the doc checklist as well).
- [ ] I have added a CHANGELOG entry if fixing/changing/adding anything substantial.
- [ ] For a new feature or a change in behaviour, I have updated the relevant docstrings in the code to describe the feature or behaviour (please see the doc checklist as well).
Out of scope for the PR
- No documentation change has been done so far
- Adapt plotly data accordingly
- timeseries/trendlines
cc: @MarcoGorelli @LiamConnors
@FBruzzesi / @MarcoGorelli, have y'all done any performance profiling to see which traces see a noticable performance bump using large e.g. Polars data frames when using the native APIs as compared to pre-native APIs (pre-Narwhals) in Plotly?
If so, would be curious the results -- if not, it's something I'd like to have our team dig into.
Similarly, do you have expectations for which trace types or trace operations will see the biggest differences from native support for high-performance dataframe APIs?
Hey @ndrezn, I expect the best improvement in performances to show whenever a path enters get_groups_and_orders and process_dataframe_hierarchy functionalities. However I didn't do any benchmarking, just focused on converting the codebase and make sure that all tests are passing for pandas (with different backends), polars and pyarrow.
I would be curious to see the results as well. Do you have any benchmarking routine or code to share? Or is this something ad-hoc?
Hey @LiamConnors, just as a heads up, all the unit tests are now passing. I will start focusing on making sure that the documentation can build as well.
The remaining CI that fail are:
- dev_build -> plotlyjs_dev_build (which seems some naming matching raises an error?)
- build -> python_38_orca (I am a bit clueless on this one, I can use some help)
thanks @FBruzzesi - I'll let @LiamConnors speak to the dev_build issue, but we're retiring support for orca entirely, so that shouldn't be a blocker. cheers - @gvwilson
Thanks @FBruzzesi! I have removed the orca tests here https://github.com/plotly/plotly.py/pull/4803 The dev_build issue is also unrelated to this PR; we are removing a couple of traces in Plotly.js and those changes are on the Plotly.js master branch now, but we haven't done all the required updates in the Plotly.py repo. https://github.com/plotly/plotly.py/issues/4792
Thanks @FBruzzesi! I have removed the orca tests here #4803 The
dev_buildissue is also unrelated to this PR; we are removing a couple of traces in Plotly.js and those changes are on the Plotly.js master branch now, but we haven't done all the required updates in the Plotly.py repo. #4792
Thanks! That's great to know! A few less things to worry about.
Following up on the docs, I was not able to reproduce the error in CI, i.e. the following runs without issues on a python 3.9 environment :
cd doc
uv pip install -r requirements.txt
make
🤔
Thanks @FBruzzesi! I have removed the orca tests here #4803 The
dev_buildissue is also unrelated to this PR; we are removing a couple of traces in Plotly.js and those changes are on the Plotly.js master branch now, but we haven't done all the required updates in the Plotly.py repo. #4792Thanks! That's great to know! A few less things to worry about.
Following up on the docs, I was not able to reproduce the error in CI, i.e. the following runs without issues on a python 3.9 environment :
cd doc uv pip install -r requirements.txt make🤔
Ah interesting. If you go to doc/build/ipynb does it look like the files generated and is there a hexbin-mapbox file there?
Looks like the failure relates to generating from that file: hexbin-mapbox.md. I managed to replicate it locally. This is what the traceback looks like
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/Desktop/plotly.py/packages/python/plotly/plotly/express/_core.py in ?(args, constructor)
1491 except Exception:
1492 msg = f"Unsupported type: {type(args['data_frame'])}"
-> 1493 raise NotImplementedError(msg)
1494 else:
/opt/anaconda3/envs/plotly-docs/lib/python3.9/site-packages/pandas/core/frame.py in ?(self, data, index, columns, dtype, copy)
755 else:
756 if index is None or columns is None:
--> 757 raise ValueError("DataFrame constructor not properly called!")
758
ValueError: DataFrame constructor not properly called!
During handling of the above exception, another exception occurred:
NotImplementedError Traceback (most recent call last)
Cell In[3], line 7
4 px.set_mapbox_access_token(open(".mapbox_token").read())
5 df = px.data.carshare()
----> 7 fig = ff.create_hexbin_mapbox(
8 data_frame=df, lat="centroid_lat", lon="centroid_lon",
9 nx_hexagon=10, opacity=0.5, labels={"color": "Point Count"},
10 min_count=1, color_continuous_scale="Viridis",
11 show_original_data=True,
12 original_data_marker=dict(size=4, opacity=0.6, color="deeppink")
13 )
14 fig.show()
File ~/Desktop/plotly.py/packages/python/plotly/plotly/figure_factory/_hexbin_mapbox.py:469, in create_hexbin_mapbox(data_frame, lat, lon, color, nx_hexagon, agg_func, animation_frame, color_discrete_sequence, color_discrete_map, labels, color_continuous_scale, range_color, color_continuous_midpoint, opacity, zoom, center, mapbox_style, title, template, width, height, min_count, show_original_data, original_data_marker)
445 fig = choropleth_mapbox(
446 data_frame=agg_data_frame.to_native(),
447 geojson=geojson,
(...)
465 height=height,
466 )
468 if show_original_data:
--> 469 original_fig = scatter_mapbox(
470 data_frame=(
471 args["data_frame"].sort(
472 by=args["animation_frame"], descending=False, nulls_last=True
473 )
474 if args["animation_frame"] is not None
475 else args["data_frame"]
476 ),
477 lat=args["lat"],
478 lon=args["lon"],
479 animation_frame=args["animation_frame"],
480 )
481 original_fig.data[0].hoverinfo = "skip"
482 original_fig.data[0].hovertemplate = None
File ~/Desktop/plotly.py/packages/python/plotly/plotly/express/_chart_types.py:1394, in scatter_mapbox(data_frame, lat, lon, color, text, hover_name, hover_data, custom_data, size, animation_frame, animation_group, category_orders, labels, color_discrete_sequence, color_discrete_map, color_continuous_scale, range_color, color_continuous_midpoint, opacity, size_max, zoom, center, mapbox_style, title, template, width, height)
1361 def scatter_mapbox(
1362 data_frame=None,
1363 lat=None,
(...)
1388 height=None,
1389 ) -> go.Figure:
1390 """
1391 In a Mapbox scatter plot, each row of `data_frame` is represented by a
1392 symbol mark on a Mapbox map.
1393 """
-> 1394 return make_figure(args=locals(), constructor=go.Scattermapbox)
File ~/Desktop/plotly.py/packages/python/plotly/plotly/express/_core.py:2347, in make_figure(args, constructor, trace_patch, layout_patch)
2344 layout_patch = layout_patch or {}
2345 apply_default_cascade(args)
-> 2347 args = build_dataframe(args, constructor)
2348 if constructor in [go.Treemap, go.Sunburst, go.Icicle] and args["path"] is not None:
2349 args = process_dataframe_hierarchy(args)
File ~/Desktop/plotly.py/packages/python/plotly/plotly/express/_core.py:1493, in build_dataframe(args, constructor)
1491 except Exception:
1492 msg = f"Unsupported type: {type(args['data_frame'])}"
-> 1493 raise NotImplementedError(msg)
1494 else:
1495 columns = None # no data_frame
NotImplementedError: Unsupported type: <class 'narwhals.stable.v1.DataFrame'>
These are the two examples that cause the error https://plotly.com/python/hexbin-mapbox/#display-the-underlying-data https://plotly.com/python/hexbin-mapbox/#hexbin-with-animation
CI is green 🚀 I am tagging as ready for review 🤞🏼
@FBruzzesi Please note that I merged master into this branch including this PR: https://github.com/plotly/plotly.py/pull/4470 for working with typed arrays. You may consider adding tests using narwhals. Thank you!
Hey @FBruzzesi 👋 I've started running some initial performance tests on these changes, and had some interesting early findings. Here's the performance I've written for reference: https://gist.github.com/ndrezn/3162c1f01b72b5d168cff30eabfdb692
First, there's a massive performance improvement when using Polars dataframes compared to previously 🥳. Amazing!
But, I'm noticing two other things:
- Pandas is still always faster than Polars in Plotly.py, which is surprising to me, but plausibly not related to this PR -- i.e. it's still better to just use Pandas. I would expect near parity or slightly faster Polars now that we've eliminated the type conversion to Pandas.
- Following the changes in this PR, Pandas performance is actually somewhat slower than before. It's not major, but I was expecting that Pandas performance would be constant.
Do you know what's going on with those results or have any initial guesses as to why Pandas performance may be degraded?
Thanks @ndrezn for doing this analysis!
I agree that the overhead is higher than desirable - we'll do some profiling, find the source, and address it, I'm confident that this is solvable
Update: I was including the dataset generating in my timings 🤦 . Thanks for the catch @emilykl. Here's the updated metrics:
- As originally expected, Polars is faster than Pandas. Wayyyyyy faster. Holy crap!
- Pandas is still slower after the addition of Narwhals, still a surprise.
Hey @ndrezn thanks for taking the time to dive into this.
Some overhead for pandas is to be expected as we are adding a layer on top of it, however this seems larger than expected. I will do some profiling in the rest of the week (starting from tomorrow) and trying to understand what is the main cause 👀
@ndrezn with the latest commit, I am able to get pandas performances on the same ballpark of master on my local machine. Could you check if you are able to replicate that when you have the time?
Update on performance from some work on some branches:
timed on a kaggle notebook:
master: scatter_polars,0.586423233000005 bar_polars,1.5854483000000528 scatter_pandas,0.5920868089999658 bar_pandas,1.620950485000094
plotly-with-narwhals (with Narwhals 1.11.0, and https://github.com/FBruzzesi/plotly.py/pull/1, right at the bottom of the notebook): scatter_polars,0.20539162399995803 bar_polars,1.159081075999893 scatter_pandas,0.5411779630001092 bar_pandas,1.5852239850000842
I'm going to work on cleaning this all up now, but in summary, the main sources of overhead were:
- pandas making unnecessary copies for some operations (rename, reset_index), which we're now being more careful about. thanks for having highlighted this area of improvement which helped us find these!
- some pandas-specific paths being missed in
BaseFigure._perform_update, meaning that we were repeatedly converting to numpy (and sometimes then making additional copies) unnecessarily
@FBruzzesi @MarcoGorelli Sharing the performance / profiling scripts I have been using --
-
Here is a script for timing plot generation for a 1 million row dataset for various chart types (with thanks to @ndrezn for authoring the original script)
-
Here is a script for getting detailed profiler information using
cProfilefor individual chart types. I have been using Snakeviz to visualize the results.
thanks! we've made some improvements to to_py_scalar in the latest release
and, as a bonus check this out: (note: it currently selects the necessary columns, then converts to PyArrow - doing everything duckdb native would require a major refactor of plotly, even if full support were available in Narwhals, but it's something we can discuss doing in 2025)
Improved error messages and expanded on in-line comments in the last two commits. I hope it is more clear now, let me know if something needs to be expanded further
@MarcoGorelli @FBruzzesi Just a heads-up, I ran another performance benchmark today and I'm seeing a significant performance drop for scatter charts compared to Wednesday -- on Wed, Narwhals was ~13% faster than master and now it's showing ~20% slower -- any idea what's up there? Maybe some of the syntax changes for readability have come with a performance hit? I'll dig into it myself but wanted to give you a heads up to see if you can reproduce.
A few updates:
- time zone issues resolved (it now displays local time, as plotly ordinarily does for pandas dataframes)
- we've simplified some syntax
- regarding timings, I noticed that in https://gist.github.com/emilykl/b4c6528a8653964b9c755c87ea3b2ec5 there's
NUM_ROWS=100. At that size I'm seeing a lot of variability between runs, and that it's pretty fast overall anyway. If I try it again withNUM_ROWS=1_000_000(as per @ndrezn 's original gist), then here's what I see (from having timed it on 4-core Kaggle notebooks):
(this shows a "best of 3", with NUM_ROWS=1_000_000)
timing and visualisation notebooks for reference
- pandas: roughly the same, overhead is always less than 5% except for ecdf plot
- Polars: noticeable difference, it even gets 2x faster at times
- PyArrow: usually no big difference, except for ecdf, funnel, and pie where there's a noticeable drop
I'll investigate ecdf, funnel, and pie, I think they should be resolvable
This is an example of a graph that renders differently each time in polars. Looks like the order for the values of x, color and facet_col change.
import plotly.express as px
df = px.data.tips(return_type="polars")
fig = px.bar(df, x="day", y="total_bill", color="smoker", barmode="group", facet_col="sex",
)
fig.show()
But for pandas the order is consistent and seems to depend on the order of the data. Would it be possible to make it consistent for polars too. cc @emilykl
But for pandas the order is consistent and seems to depend on the order of the data. Would it be possible to make it consistent for polars too. cc @emilykl
yup, addressed in the latest commit 👌
latest timings: https://www.kaggle.com/code/marcogorelli/visualise-timings?scriptVersionId=205986199
Hey @emilykl thanks a ton! We really appreciated the collaborative effort! As mention on a few meetings, it led us to investigate deeper and improve narwhals codebase significantly as well, it has been such a win-win!
Regarding merging, the branch keeps going out of sync with master, and I will need another approval to merge it myself (and hopefully I need to time it correctly 😇) or you can approve and merge whenever you want. There are no other changes on this PR from our side 🚀
@FBruzzesi @MarcoGorelli Merged! 🚀 🎉 💥