plotly.py icon indicating copy to clipboard operation
plotly.py copied to clipboard

convert plotly-express to use narwhals

Open FBruzzesi opened this issue 1 year ago • 8 comments

Description

This PR migrates plotly-express module logic from pandas to narwhals. In this way, pandas is not a required dependency for plotly-express (or at least for its entirety - e.g. trendlines will still require pandas for now) and users coming with polars, pyarrow or other eager dataframes supported in narwhals do not need to depend on pandas in the first place.

Related issue #4749

Code PR

  • [x] I have read through the contributing notes and understand the structure of the package. In particular, if my PR modifies code of plotly.graph_objects, my modifications concern the codegen files and not generated files.
  • [x] I have added tests (if submitting a new feature or correcting a bug) or modified existing tests.
  • [ ] For a new feature, I have added documentation examples in an existing or new tutorial notebook (please see the doc checklist as well).
  • [ ] I have added a CHANGELOG entry if fixing/changing/adding anything substantial.
  • [ ] For a new feature or a change in behaviour, I have updated the relevant docstrings in the code to describe the feature or behaviour (please see the doc checklist as well).

Out of scope for the PR

  • No documentation change has been done so far
  • Adapt plotly data accordingly
  • timeseries/trendlines

cc: @MarcoGorelli @LiamConnors

FBruzzesi avatar Oct 09 '24 11:10 FBruzzesi

@FBruzzesi / @MarcoGorelli, have y'all done any performance profiling to see which traces see a noticable performance bump using large e.g. Polars data frames when using the native APIs as compared to pre-native APIs (pre-Narwhals) in Plotly?

If so, would be curious the results -- if not, it's something I'd like to have our team dig into.

Similarly, do you have expectations for which trace types or trace operations will see the biggest differences from native support for high-performance dataframe APIs?

ndrezn avatar Oct 09 '24 15:10 ndrezn

Hey @ndrezn, I expect the best improvement in performances to show whenever a path enters get_groups_and_orders and process_dataframe_hierarchy functionalities. However I didn't do any benchmarking, just focused on converting the codebase and make sure that all tests are passing for pandas (with different backends), polars and pyarrow.

I would be curious to see the results as well. Do you have any benchmarking routine or code to share? Or is this something ad-hoc?

FBruzzesi avatar Oct 09 '24 16:10 FBruzzesi

Hey @LiamConnors, just as a heads up, all the unit tests are now passing. I will start focusing on making sure that the documentation can build as well.

The remaining CI that fail are:

  • dev_build -> plotlyjs_dev_build (which seems some naming matching raises an error?)
  • build -> python_38_orca (I am a bit clueless on this one, I can use some help)

FBruzzesi avatar Oct 16 '24 12:10 FBruzzesi

thanks @FBruzzesi - I'll let @LiamConnors speak to the dev_build issue, but we're retiring support for orca entirely, so that shouldn't be a blocker. cheers - @gvwilson

gvwilson avatar Oct 16 '24 12:10 gvwilson

Thanks @FBruzzesi! I have removed the orca tests here https://github.com/plotly/plotly.py/pull/4803 The dev_build issue is also unrelated to this PR; we are removing a couple of traces in Plotly.js and those changes are on the Plotly.js master branch now, but we haven't done all the required updates in the Plotly.py repo. https://github.com/plotly/plotly.py/issues/4792

LiamConnors avatar Oct 16 '24 13:10 LiamConnors

Thanks @FBruzzesi! I have removed the orca tests here #4803 The dev_build issue is also unrelated to this PR; we are removing a couple of traces in Plotly.js and those changes are on the Plotly.js master branch now, but we haven't done all the required updates in the Plotly.py repo. #4792

Thanks! That's great to know! A few less things to worry about.

Following up on the docs, I was not able to reproduce the error in CI, i.e. the following runs without issues on a python 3.9 environment :

cd doc
uv pip install -r requirements.txt
make

🤔

FBruzzesi avatar Oct 16 '24 13:10 FBruzzesi

Thanks @FBruzzesi! I have removed the orca tests here #4803 The dev_build issue is also unrelated to this PR; we are removing a couple of traces in Plotly.js and those changes are on the Plotly.js master branch now, but we haven't done all the required updates in the Plotly.py repo. #4792

Thanks! That's great to know! A few less things to worry about.

Following up on the docs, I was not able to reproduce the error in CI, i.e. the following runs without issues on a python 3.9 environment :

cd doc
uv pip install -r requirements.txt
make

🤔

Ah interesting. If you go to doc/build/ipynb does it look like the files generated and is there a hexbin-mapbox file there? Looks like the failure relates to generating from that file: hexbin-mapbox.md. I managed to replicate it locally. This is what the traceback looks like

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/Desktop/plotly.py/packages/python/plotly/plotly/express/_core.py in ?(args, constructor)
   1491             except Exception:
   1492                 msg = f"Unsupported type: {type(args['data_frame'])}"
-> 1493                 raise NotImplementedError(msg)
   1494     else:

/opt/anaconda3/envs/plotly-docs/lib/python3.9/site-packages/pandas/core/frame.py in ?(self, data, index, columns, dtype, copy)
    755         else:
    756             if index is None or columns is None:
--> 757                 raise ValueError("DataFrame constructor not properly called!")
    758 

ValueError: DataFrame constructor not properly called!

During handling of the above exception, another exception occurred:

NotImplementedError                       Traceback (most recent call last)
Cell In[3], line 7
      4 px.set_mapbox_access_token(open(".mapbox_token").read())
      5 df = px.data.carshare()
----> 7 fig = ff.create_hexbin_mapbox(
      8     data_frame=df, lat="centroid_lat", lon="centroid_lon",
      9     nx_hexagon=10, opacity=0.5, labels={"color": "Point Count"},
     10     min_count=1, color_continuous_scale="Viridis",
     11     show_original_data=True,
     12     original_data_marker=dict(size=4, opacity=0.6, color="deeppink")
     13 )
     14 fig.show()

File ~/Desktop/plotly.py/packages/python/plotly/plotly/figure_factory/_hexbin_mapbox.py:469, in create_hexbin_mapbox(data_frame, lat, lon, color, nx_hexagon, agg_func, animation_frame, color_discrete_sequence, color_discrete_map, labels, color_continuous_scale, range_color, color_continuous_midpoint, opacity, zoom, center, mapbox_style, title, template, width, height, min_count, show_original_data, original_data_marker)
    445 fig = choropleth_mapbox(
    446     data_frame=agg_data_frame.to_native(),
    447     geojson=geojson,
   (...)
    465     height=height,
    466 )
    468 if show_original_data:
--> 469     original_fig = scatter_mapbox(
    470         data_frame=(
    471             args["data_frame"].sort(
    472                 by=args["animation_frame"], descending=False, nulls_last=True
    473             )
    474             if args["animation_frame"] is not None
    475             else args["data_frame"]
    476         ),
    477         lat=args["lat"],
    478         lon=args["lon"],
    479         animation_frame=args["animation_frame"],
    480     )
    481     original_fig.data[0].hoverinfo = "skip"
    482     original_fig.data[0].hovertemplate = None

File ~/Desktop/plotly.py/packages/python/plotly/plotly/express/_chart_types.py:1394, in scatter_mapbox(data_frame, lat, lon, color, text, hover_name, hover_data, custom_data, size, animation_frame, animation_group, category_orders, labels, color_discrete_sequence, color_discrete_map, color_continuous_scale, range_color, color_continuous_midpoint, opacity, size_max, zoom, center, mapbox_style, title, template, width, height)
   1361 def scatter_mapbox(
   1362     data_frame=None,
   1363     lat=None,
   (...)
   1388     height=None,
   1389 ) -> go.Figure:
   1390     """
   1391     In a Mapbox scatter plot, each row of `data_frame` is represented by a
   1392     symbol mark on a Mapbox map.
   1393     """
-> 1394     return make_figure(args=locals(), constructor=go.Scattermapbox)

File ~/Desktop/plotly.py/packages/python/plotly/plotly/express/_core.py:2347, in make_figure(args, constructor, trace_patch, layout_patch)
   2344 layout_patch = layout_patch or {}
   2345 apply_default_cascade(args)
-> 2347 args = build_dataframe(args, constructor)
   2348 if constructor in [go.Treemap, go.Sunburst, go.Icicle] and args["path"] is not None:
   2349     args = process_dataframe_hierarchy(args)

File ~/Desktop/plotly.py/packages/python/plotly/plotly/express/_core.py:1493, in build_dataframe(args, constructor)
   1491         except Exception:
   1492             msg = f"Unsupported type: {type(args['data_frame'])}"
-> 1493             raise NotImplementedError(msg)
   1494 else:
   1495     columns = None  # no data_frame

NotImplementedError: Unsupported type: <class 'narwhals.stable.v1.DataFrame'>

These are the two examples that cause the error https://plotly.com/python/hexbin-mapbox/#display-the-underlying-data https://plotly.com/python/hexbin-mapbox/#hexbin-with-animation

LiamConnors avatar Oct 16 '24 14:10 LiamConnors

CI is green 🚀 I am tagging as ready for review 🤞🏼

FBruzzesi avatar Oct 16 '24 15:10 FBruzzesi

@FBruzzesi Please note that I merged master into this branch including this PR: https://github.com/plotly/plotly.py/pull/4470 for working with typed arrays. You may consider adding tests using narwhals. Thank you!

archmoj avatar Oct 21 '24 16:10 archmoj

Hey @FBruzzesi 👋 I've started running some initial performance tests on these changes, and had some interesting early findings. Here's the performance I've written for reference: https://gist.github.com/ndrezn/3162c1f01b72b5d168cff30eabfdb692

Screenshot 2024-10-21 at 3 54 40 PM

First, there's a massive performance improvement when using Polars dataframes compared to previously 🥳. Amazing!

But, I'm noticing two other things:

  1. Pandas is still always faster than Polars in Plotly.py, which is surprising to me, but plausibly not related to this PR -- i.e. it's still better to just use Pandas. I would expect near parity or slightly faster Polars now that we've eliminated the type conversion to Pandas.
  2. Following the changes in this PR, Pandas performance is actually somewhat slower than before. It's not major, but I was expecting that Pandas performance would be constant.

Do you know what's going on with those results or have any initial guesses as to why Pandas performance may be degraded?

ndrezn avatar Oct 21 '24 19:10 ndrezn

Thanks @ndrezn for doing this analysis!

I agree that the overhead is higher than desirable - we'll do some profiling, find the source, and address it, I'm confident that this is solvable

MarcoGorelli avatar Oct 21 '24 21:10 MarcoGorelli

Update: I was including the dataset generating in my timings 🤦 . Thanks for the catch @emilykl. Here's the updated metrics: Screenshot 2024-10-21 at 6 09 44 PM

  1. As originally expected, Polars is faster than Pandas. Wayyyyyy faster. Holy crap!
  2. Pandas is still slower after the addition of Narwhals, still a surprise.

ndrezn avatar Oct 21 '24 21:10 ndrezn

Hey @ndrezn thanks for taking the time to dive into this.

Some overhead for pandas is to be expected as we are adding a layer on top of it, however this seems larger than expected. I will do some profiling in the rest of the week (starting from tomorrow) and trying to understand what is the main cause 👀

FBruzzesi avatar Oct 22 '24 09:10 FBruzzesi

@ndrezn with the latest commit, I am able to get pandas performances on the same ballpark of master on my local machine. Could you check if you are able to replicate that when you have the time?

FBruzzesi avatar Oct 22 '24 15:10 FBruzzesi

Update on performance from some work on some branches:

timed on a kaggle notebook:

master: scatter_polars,0.586423233000005 bar_polars,1.5854483000000528 scatter_pandas,0.5920868089999658 bar_pandas,1.620950485000094

plotly-with-narwhals (with Narwhals 1.11.0, and https://github.com/FBruzzesi/plotly.py/pull/1, right at the bottom of the notebook): scatter_polars,0.20539162399995803 bar_polars,1.159081075999893 scatter_pandas,0.5411779630001092 bar_pandas,1.5852239850000842

I'm going to work on cleaning this all up now, but in summary, the main sources of overhead were:

  • pandas making unnecessary copies for some operations (rename, reset_index), which we're now being more careful about. thanks for having highlighted this area of improvement which helped us find these!
  • some pandas-specific paths being missed in BaseFigure._perform_update, meaning that we were repeatedly converting to numpy (and sometimes then making additional copies) unnecessarily

MarcoGorelli avatar Oct 26 '24 19:10 MarcoGorelli

@FBruzzesi @MarcoGorelli Sharing the performance / profiling scripts I have been using --

  • Here is a script for timing plot generation for a 1 million row dataset for various chart types (with thanks to @ndrezn for authoring the original script)

  • Here is a script for getting detailed profiler information using cProfile for individual chart types. I have been using Snakeviz to visualize the results.

emilykl avatar Oct 29 '24 16:10 emilykl

thanks! we've made some improvements to to_py_scalar in the latest release

and, as a bonus check this out: (note: it currently selects the necessary columns, then converts to PyArrow - doing everything duckdb native would require a major refactor of plotly, even if full support were available in Narwhals, but it's something we can discuss doing in 2025)

image

MarcoGorelli avatar Oct 29 '24 18:10 MarcoGorelli

Improved error messages and expanded on in-line comments in the last two commits. I hope it is more clear now, let me know if something needs to be expanded further

FBruzzesi avatar Oct 30 '24 11:10 FBruzzesi

@MarcoGorelli @FBruzzesi Just a heads-up, I ran another performance benchmark today and I'm seeing a significant performance drop for scatter charts compared to Wednesday -- on Wed, Narwhals was ~13% faster than master and now it's showing ~20% slower -- any idea what's up there? Maybe some of the syntax changes for readability have come with a performance hit? I'll dig into it myself but wanted to give you a heads up to see if you can reproduce.

emilykl avatar Nov 01 '24 17:11 emilykl

A few updates:

  • time zone issues resolved (it now displays local time, as plotly ordinarily does for pandas dataframes)
  • we've simplified some syntax
  • regarding timings, I noticed that in https://gist.github.com/emilykl/b4c6528a8653964b9c755c87ea3b2ec5 there's NUM_ROWS=100. At that size I'm seeing a lot of variability between runs, and that it's pretty fast overall anyway. If I try it again with NUM_ROWS=1_000_000 (as per @ndrezn 's original gist), then here's what I see (from having timed it on 4-core Kaggle notebooks):

(this shows a "best of 3", with NUM_ROWS=1_000_000)

image

timing and visualisation notebooks for reference

  • pandas: roughly the same, overhead is always less than 5% except for ecdf plot
  • Polars: noticeable difference, it even gets 2x faster at times
  • PyArrow: usually no big difference, except for ecdf, funnel, and pie where there's a noticeable drop

I'll investigate ecdf, funnel, and pie, I think they should be resolvable

MarcoGorelli avatar Nov 05 '24 13:11 MarcoGorelli

This is an example of a graph that renders differently each time in polars. Looks like the order for the values of x, color and facet_col change.

import plotly.express as px
df = px.data.tips(return_type="polars")
fig = px.bar(df, x="day", y="total_bill", color="smoker", barmode="group", facet_col="sex",
            )
fig.show()

But for pandas the order is consistent and seems to depend on the order of the data. Would it be possible to make it consistent for polars too. cc @emilykl

LiamConnors avatar Nov 06 '24 14:11 LiamConnors

But for pandas the order is consistent and seems to depend on the order of the data. Would it be possible to make it consistent for polars too. cc @emilykl

yup, addressed in the latest commit 👌

latest timings: https://www.kaggle.com/code/marcogorelli/visualise-timings?scriptVersionId=205986199

MarcoGorelli avatar Nov 08 '24 15:11 MarcoGorelli

Hey @emilykl thanks a ton! We really appreciated the collaborative effort! As mention on a few meetings, it led us to investigate deeper and improve narwhals codebase significantly as well, it has been such a win-win!

Regarding merging, the branch keeps going out of sync with master, and I will need another approval to merge it myself (and hopefully I need to time it correctly 😇) or you can approve and merge whenever you want. There are no other changes on this PR from our side 🚀

FBruzzesi avatar Nov 13 '24 08:11 FBruzzesi

@FBruzzesi @MarcoGorelli Merged! 🚀 🎉 💥

emilykl avatar Nov 13 '24 16:11 emilykl