narwhals [Enh]: Better benchmarking routine

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

No response

Please describe the purpose of the new feature or describe the problem to solve.

There are some features that can require extra attention or are worth benchmarking to understand if worth implementing. For example I am thinking of #500 and #743.

Suggest a solution if possible.

I checked how other libraries do that, specifically pydantic. They use codspeed which seems to have a free tier for public repos.

Question is: what to benchmark?! Would TPCH queries in main vs branch be a reasonable test?

If you have tried alternatives, please describe them below.

Currently very manual effort on kaggle

Additional information that may help us understand your needs.

No response

Aug 17 '24 16:08 FBruzzesi

Hey, I installed Codspeed for benchmarking on pydata/sparse. I'm transforming benchmarks from asv to codspeed "as we speak".

Would TPCH queries in main vs branch be a reasonable test?

Codpeed tests the opened PR against main, if that was the question. 🤔 It can be setup to block the merge if there is a regression. It sends the report as a comment on the PR. If you need details (or help) let me know. Those are my 15 cents. 😇 edit: Benchmarks for Codspeed run in the CI. TPCH tests with a lot of data, doesn't it?

Aug 18 '24 08:08 DeaMariaLeon

Hey Dea, thanks for the input. That's what happens when I open issues in a rush. Let me try to clarify some points and ideas.

My understanding is that one can mark some test for benchmarking, and I am wondering what could these test be.

TPCH tests with a lot of data, doesn't it?

One option is to run TPCH queries with the subset of the data we have in tests/data/ folder. It should not take as long as the actual TPCH benchmarking.

Codpeed tests the opened PR against main, if that was the question. 🤔

Yes that is exactly my point: PR (branch) vs main, so I am getting the process right 👌 I wonder though if it could be trigger only, as it is definitly an overkill for most PRs.

If you need details (or help) let me know.

I have never used it so far, I am happy to give it a spin, but expect to be pinged for help 🙈

Aug 18 '24 12:08 FBruzzesi

I wonder though if it could be trigger only, as it is definitly an overkill for most PRs.

It doesn't say in the documentation. I guess one could do it "somehow" with the CI, but I don't think that it's an out-of-the-box option. You can select if the report is sent all the time to the PR, or only if there is a failure/improvement... but that's all they mention.

but expect to be pinged for help

I truly doubt that you'll ever need help from me 😁 .. but sure!

Aug 18 '24 13:08 DeaMariaLeon

Commenting to discuss the idea: as plotly is understandably concerned about performances, maybe we could use the script they shared to assess if we have a performance drop

Nov 01 '24 17:11 FBruzzesi

https://github.com/narwhals-dev/narwhals/pull/2363#issuecomment-2789956235
https://github.com/narwhals-dev/narwhals/pull/2363#issuecomment-2793919575

@FBruzzesi, @MarcoGorelli

Maybe this issue is worth revisiting?

Apr 10 '25 15:04 dangotbanned

Thanks for reviving this @dangotbanned

Last time we tried, we noticed that running tpch queries was taking way too long for how we want to iterate. At the same time, the project is getting more relevant so we should think of something to test which is reasonable to run as well.

Similarly to how we are going to do for pyspark, we can trigger such CI job just with dedicated label(s) and via workflow dispatch (manual trigger)

Apr 12 '25 20:04 FBruzzesi

Are we able to control the size of the datasets used in the TPCH queries? At a glance, it looks like this could influence it:

https://github.com/narwhals-dev/narwhals/blob/52355ea5224bb30521f5935a54d80c43608ecefe/tpch/generate_data.py#L14

My thinking is any overhead narwhals adds likely wouldn't grow at the same rate (if at all) were we to go from say 1_000 -> 1_000_000 rows. I don't know what the actual sizes would be - but if we could limit rows - then I think we could get shorter tests without any downsides 🤞.

I do think the overhead will grow in relation to the number of columns though. That could be something worth looking into?

I'd be interested in anything that could spend a non-trivial length of time in narwhals, so things like:

group_by()
.over()
Long expression chains for pandas, pyarrow
Anything to do with broadcasting
Complex Selectors on large Schemas
Heavy usage of CompliantExprNameNamespace

Apr 12 '25 21:04 dangotbanned

@MarcoGorelli hope to see something like the profiling in (#2479) used here

Maybe we'd want to consider a tool that's more actively maintained than (https://github.com/nschloe/tuna)?

[ ] https://github.com/nschloe/tuna/pull/166

May 04 '25 12:05 dangotbanned

Maybe we'd want to consider a tool that's more actively maintained than (https://github.com/nschloe/tuna)?

I think codspeed is a very good option for public projects (it is used by pydantic and astral as well).

In my opinion everything just comes down to the following:

What should we test?
- Are TPCH queries enough or at least a good starting point?
- In case, what else should we consider (see @dangotbanned comment)
Would that run in a reasonable amount of time?
- We could use a special label to trigger the run (e.g. performance and release)
- In #972 we were using TPCH with 0.25 ratio and it was taking ~40mins to run IIRC. That's a bit much for what I would consider fast iteration - maybe a ratio of 0.1 is more reasonable to start with

May 04 '25 16:05 FBruzzesi

In #972 we were using TPCH with 0.25 ratio and it was taking ~40mins to run IIRC. That's a bit much for what I would consider fast iteration - maybe a ratio of 0.1 is more reasonable to start with

IIRC the docs for the duckdb TPCH tests used 0.01 - so we can go lower

May 04 '25 17:05 dangotbanned