ibis icon indicating copy to clipboard operation
ibis copied to clipboard

perf: Ibis slows down considerably for wider tables

Open binste opened this issue 1 year ago • 8 comments

What happened?

In a web application, I'm creating rather complicated Ibis expressions in the backend. The execution of the final expression takes < 1 second but creating the expression itself took up to 30 seconds. Took me a while to figure out what's going on but I think it's because Ibis gets considerably slower for wider tables. I don't need all columns in my app and so I was now able to improve performance by just pruning away right from the beginning what I don't need.

However, in case you see any improvement potential, below some example code to demonstrate it. If this is just inherent to Ibis, what do you think about a note in the documentation? I only found #6832 which is somewhat related.

Setup

import ibis
import ibis.selectors as s
from ibis import deferred as d_

t = ibis.table(name="base1", schema={f"col{i}": "int64" for i in range(200)})

Drop and relocate

In more complex expressions, having multiple of these drop statements can quickly sum up:

t1 = t.drop("col0", "col1", "col2")  # 500ms

Same for relocate:

t1 = t.relocate("col100", before="col2")  # 1,700ms -> 1.7 seconds

select and mutate do not have this issue:

t1 = t.select("col0", "col1", "col2")  # 3ms
t1 = t.mutate(colnew=ibis.literal(1))  # 20ms

Selectors

I've also noticed that Ibis selectors can be much slower than using a pure Python implementation:

t.mutate(s.across(s.c(*[f"col{i}" for i in range(20)]), d_.cast("str")))  # 580ms
t.mutate(**{f"col{i}": d_[f"col{i}"] for i in range(20)})  # 70ms

What version of ibis are you using?

9.0.0

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

binste avatar May 03 '24 08:05 binste

I've benchmarked Ibis 8 vs. 9 on 55 Ibis expressions which are part of a Data Warehouse built using https://github.com/binste/dbt-ibis. I've measured:

  • Execution of Ibis code: The time it takes to execute the Ibis code to get to the final expression
  • Convert Ibis expression to SQL: The time it takes to compile the final Ibis expression to SQL

Here some pseudo code to illustrate

import ibis

t = ibis.table(...)

# --------------------------
# START: Execution of Ibis code
# --------------------------
t = t.group_by(...).select(...)

# --------------------------
# END: Execution of Ibis code
# --------------------------

# --------------------------
# START: Convert Ibis expression to SQL
# --------------------------
ibis.to_sql(t)
# --------------------------
# END: Convert Ibis expression to SQL
# --------------------------

image

image

The great news is that the compilation to SQL got significantly faster with the move to SQLGlot which is super nice! :) The execution of Ibis code on the other hand got a bit slower with one expression taking significantly longer with 11 seconds. I've profiled that expression and most time is spent in the following statements:

  • Table.relocate ibis/expr/types/relations.py:4279
    • Code here looks like this: .relocate("valid_to", after="valid_from")
  • Join.wrapper ibis/expr/types/joins.py:188
  • Table.drop ibis/expr/types/relations.py:2329
  • Table.rename ibis/expr/types/relations.py:2140

Hope this helps!

binste avatar May 03 '24 13:05 binste

@binste Thanks for digging into this. Interesting results!

Can you share the query that's now taking 11 seconds with Ibis 9.0?

cpcloud avatar May 03 '24 13:05 cpcloud

Regarding drop and relocate, they're both implemented using a somewhat naive approach:

  • drop: project every column except the input columns
  • relocate: basically a reprojection that loops through every column

I suspect that for drop we can turn it into a special operation that we should at least be able to make the operation scale with the number of dropped columns instead of the total number of columns. A place to start might be:

  1. create a special Drop relational operation
  2. implement the column drop in select merging by looping over the drop list and removing each element of the drop list from the Select fields. Select fields are dicts, so that should get us to O(droplist) instead of O(allcolumns).

We may be able to take a similar approach for relocate by using a data structure more optimized for the operation. I think something that has fast ordered inserts (perhaps sortedcontainers has a sorted map container?) might be a good place to start.

cpcloud avatar May 03 '24 13:05 cpcloud

Unfortunately, I can't as I'd need to mask column names and code logic for IP reasons. It's 2 input tables with each around 50 columns and 1 table with ~10 columns and then various operations on top of it. But happy to test out any PRs if there is a wheel file available!

binste avatar May 03 '24 13:05 binste

Naive benchmark here, but for a quick test:

# drop_test.py
import ibis
import time
from contextlib import contextmanager


@contextmanager
def tictoc(num_cols):
    tic = time.time()
    yield
    toc = time.time()
    print(f"| {num_cols} | {toc - tic} seconds")


print(f"{ibis.__version__=}")


for num_cols in [10, 20, 50, 100, 200, 500, 1000]:
    t = ibis.table(name="t", schema=[(f"a{i}", "int") for i in range(num_cols)])

    with tictoc(num_cols):
        t.drop("a8")
🐚 python drop_test.py
ibis.__version__='8.0.0'
| 10 | 0.18016910552978516 seconds
| 20 | 0.0016529560089111328 seconds
| 50 | 0.0037081241607666016 seconds
| 100 | 0.007429361343383789 seconds
| 200 | 0.013902902603149414 seconds
| 500 | 0.03390693664550781 seconds
| 1000 | 0.06650948524475098 seconds
🐚 python drop_test.py
ibis.__version__='9.0.0'
| 10 | 0.002298593521118164 seconds
| 20 | 0.005956888198852539 seconds
| 50 | 0.027918338775634766 seconds
| 100 | 0.09690213203430176 seconds
| 200 | 0.36721324920654297 seconds
| 500 | 2.301510810852051 seconds
| 1000 | 9.317416191101074 seconds

gforsyth avatar May 03 '24 18:05 gforsyth

@binste Any chance you can try with a pre-release wheel?

Benchmarking locally (I'll PR this code in a bit) I do still see linear scaling, but the constant factor isn't nearly as bad as @gforsyth's benchmark shows

ibis/tests/benchmarks/test_benchmarks.py .......                                                                                                                                                        [100%]


----------------------------------------------------------------------------- benchmark 'test_wide_drop': 7 tests -----------------------------------------------------------------------------
Name (time in ms)             Min                 Max                Mean             StdDev              Median                IQR            Outliers       OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_wide_drop[10]         1.5730 (1.0)        2.1650 (1.0)        1.6427 (1.0)       0.0519 (1.0)        1.6336 (1.0)       0.0436 (1.22)        71;24  608.7550 (1.0)         404           1
test_wide_drop[20]         2.8863 (1.83)       3.5940 (1.66)       2.9740 (1.81)      0.0554 (1.07)       2.9666 (1.82)      0.0357 (1.0)         19;14  336.2452 (0.55)        317           1
test_wide_drop[50]         6.8432 (4.35)       7.4251 (3.43)       7.0100 (4.27)      0.0750 (1.45)       7.0007 (4.29)      0.0611 (1.71)         28;6  142.6526 (0.23)        138           1
test_wide_drop[100]       13.6984 (8.71)      14.2588 (6.59)      13.9413 (8.49)      0.1101 (2.12)      13.9258 (8.52)      0.1516 (4.25)         20;1   71.7293 (0.12)         70           1
test_wide_drop[200]       26.8306 (17.06)     28.1245 (12.99)     27.5633 (16.78)     0.4098 (7.90)      27.6801 (16.94)     0.7497 (21.03)        14;0   36.2801 (0.06)         36           1
test_wide_drop[500]       67.1504 (42.69)    112.0973 (51.78)     70.7434 (43.07)    11.4473 (220.75)    67.6779 (41.43)     0.7385 (20.72)         1;1   14.1356 (0.02)         15           1
test_wide_drop[1000]     135.7117 (86.27)    185.6611 (85.76)    147.9776 (90.08)    20.9796 (404.56)   137.2259 (84.00)    21.4762 (602.41)        2;0    6.7578 (0.01)          8           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
============================================================================================== 7 passed in 8.82s ==============================================================================================

cpcloud avatar Jun 25 '24 17:06 cpcloud

Thanks for working on this! Yes, I can thest it with a pre-release wheel. Let me know when it's ready and I'll give it a try.

binste avatar Jun 26 '24 09:06 binste

@binste The pre-release wheels are published every Sunday, so give it a go as soon as you're able to using something like this to install a pre-release wheel with ibis-framework + the duckdb backend dependencies:

pip install --pre -U 'ibis-framework[duckdb]`

cpcloud avatar Jun 26 '24 10:06 cpcloud

Tested it with ibis-framework==10.0.0.dev120 and it's a big improvement! Executing all our internal dbt-ibis models takes now 8 seconds instead of 41. :) If the same could be done for relocate, rename, etc. then I'd assume that number goes down further but already thanks for this! Looking forward to version 10.

binste avatar Jul 01 '24 08:07 binste

@binste Following up here, would you be able to profile your code again and determine whether relocate and rename are still performing undesirably?

cpcloud avatar Jul 19 '24 13:07 cpcloud

I did a quick benchmark on rename() I reused Gil's example but now doing:

for num_cols in [10, 20, 50, 100, 200, 500, 1000]:
    t = ibis.table(name="t", schema=[(f"MyCol{i}", "int") for i in range(num_cols)])

    with tictoc(num_cols):
        t.rename("snake_case")

and here are the results, this looks ok to me:

 ➜ python rename_test.py
ibis.__version__='8.0.0'
| 10 | 0.24251699447631836 seconds
| 20 | 0.0025930404663085938 seconds
| 50 | 0.009926080703735352 seconds
| 100 | 0.007010936737060547 seconds
| 200 | 0.021643877029418945 seconds
| 500 | 0.04538917541503906 seconds
| 1000 | 0.06621170043945312 seconds
 ➜ python rename_test.py
ibis.__version__='9.1.0'
| 10 | 0.0009710788726806641 seconds
| 20 | 0.0013387203216552734 seconds
| 50 | 0.0033159255981445312 seconds
| 100 | 0.006360054016113281 seconds
| 200 | 0.012788057327270508 seconds
| 500 | 0.031948089599609375 seconds
| 1000 | 0.06496000289916992 seconds

ncclementi avatar Jul 19 '24 17:07 ncclementi

@ncclementi Thanks! Yeah, I see slightly different results for main:

| 10 | 0.0028498172760009766 seconds
| 20 | 0.0045757293701171875 seconds
| 50 | 0.011339426040649414 seconds
| 100 | 0.02110433578491211 seconds
| 200 | 0.04119515419006348 seconds
| 500 | 0.10248041152954102 seconds
| 1000 | 0.2085111141204834 seconds

cpcloud avatar Jul 19 '24 17:07 cpcloud

Using pyinstrument, I see about a 2x speedup in rename from 8.0.0 to 9.1.0:

script:

   from __future__ import annotations

   import ibis

   num_cols = 1000
   t = ibis.table(name="t", schema=[(f"MyCol{i}", "int") for i in range(num_cols)])

   method = "snake_case"

   from pyinstrument import Profiler

   profiler = Profiler()
   profiler.start()

   t.rename(method)

   profiler.stop()

   profiler.print()

cpcloud avatar Jul 19 '24 18:07 cpcloud

I think there are a couple things left to explore here.

rename and relocate in the worst case need to look at every column, and that's currently the bottleneck AFAICT, though exploring with pyinstrument to see what is so expensive per column is worth looking into. That way we wouldn't have to gut anything in the top level method if we can figure out how to improve the per column cost.

For example, in a past version of a performance problem related to wide tables, we were doing some unnecessary validation of columns that were already part of a table and thus had already been validated. There might be something like that here.

I would first start by running the smallest possible bit of code that is slow through pyinstrument and then understanding whether the per column cost can be reduced.

cpcloud avatar Jul 19 '24 19:07 cpcloud

For example, for rename, we probably don't need any of the DerefMap machinery that's happening in bind. DerefMap is for cases where there might otherwise be ambiguous parent tables, but in a rename we already know that everything is bound.

cpcloud avatar Jul 19 '24 19:07 cpcloud

Same with the rewrite_project_input calls, we don't need them.

cpcloud avatar Jul 19 '24 19:07 cpcloud

I'll throw up a PR to remove those and see what breaks.

cpcloud avatar Jul 19 '24 19:07 cpcloud

For relocate the problem is a bit trickier, but since it's currently using the full binding machinery as well, we can probably shave a decent constant factor off the top, and follow up with an algorithmic improvements later.

cpcloud avatar Jul 19 '24 19:07 cpcloud

@binste I've put up ~two~ three PRs that should hopefully put this issue to bed for now 😄.

#9641: rename performance

Avoids some redundant validation overhead that becomes expensive once there are enough columns.

Total improvement: 6-12x according to benchmarks.

#9644: relocate performance

This PR includes the changes from #9641 and #9646 and fixes an issue with unnecessary repeated traversal of every column.

Total improvement: 7-12x according to benchmarks.

#9646: drop performance

We tackled a big chunk of this in previous PR (#9440), but there was a bit left after poking around using the fast(er) column binding I discovered in #9641.

Total improvement: 2-3x according to benchmarks.

If you can give #9644 a try and report back that'd be great!

cpcloud avatar Jul 20 '24 12:07 cpcloud

Great progress!! Happy to see these speedups, that's really going to help us with our dbt-ibis setup. Once there is a dev wheel file (I guess once #9644 is merged and there is a publishing on a Sunday), happy to try it out and report back! Thanks again.

binste avatar Jul 20 '24 19:07 binste

@binste We'll cut 9.2.0 ~today, so you should be able to give this a go shortly 🚀

cpcloud avatar Jul 22 '24 17:07 cpcloud

Importing, executing, and compiling the 56 Ibis expressions in our data warehouse to SQL now takes 6-7 seconds 🚀🚀🚀 That's much better, thank you very much for digging into this and fixing it!!

Maybe I get to do a more detailed benchmarking at a later stage to see if I can identify anything else but this definitely puts the issue to rest regarding wide tables :)

binste avatar Jul 23 '24 09:07 binste

@binste Nice!

So, just to clarify, when you opened this issue you had a single expression that took ~30 seconds to compile and now every expression out of 56 compiles in 6-7 seconds?

Want to make sure I'm capturing the improvement accurately!

cpcloud avatar Jul 23 '24 13:07 cpcloud

I don't have the exact figures anymore but it's roughly this:

  • At one point, I think in Ibis 8 or 9.0, a single expression on a wide table took ~30 seconds to compile.
  • Compiling all 56 expressions took between 30-50 seconds in Ibis 8/9.0 with improvements between 8 and 9
  • Now, we're down to a total compilation time of ~6 seconds for all 56 expressions

binste avatar Jul 24 '24 06:07 binste