perf: Ibis slows down considerably for wider tables
What happened?
In a web application, I'm creating rather complicated Ibis expressions in the backend. The execution of the final expression takes < 1 second but creating the expression itself took up to 30 seconds. Took me a while to figure out what's going on but I think it's because Ibis gets considerably slower for wider tables. I don't need all columns in my app and so I was now able to improve performance by just pruning away right from the beginning what I don't need.
However, in case you see any improvement potential, below some example code to demonstrate it. If this is just inherent to Ibis, what do you think about a note in the documentation? I only found #6832 which is somewhat related.
Setup
import ibis
import ibis.selectors as s
from ibis import deferred as d_
t = ibis.table(name="base1", schema={f"col{i}": "int64" for i in range(200)})
Drop and relocate
In more complex expressions, having multiple of these drop statements can quickly sum up:
t1 = t.drop("col0", "col1", "col2") # 500ms
Same for relocate:
t1 = t.relocate("col100", before="col2") # 1,700ms -> 1.7 seconds
select and mutate do not have this issue:
t1 = t.select("col0", "col1", "col2") # 3ms
t1 = t.mutate(colnew=ibis.literal(1)) # 20ms
Selectors
I've also noticed that Ibis selectors can be much slower than using a pure Python implementation:
t.mutate(s.across(s.c(*[f"col{i}" for i in range(20)]), d_.cast("str"))) # 580ms
t.mutate(**{f"col{i}": d_[f"col{i}"] for i in range(20)}) # 70ms
What version of ibis are you using?
9.0.0
What backend(s) are you using, if any?
No response
Relevant log output
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
I've benchmarked Ibis 8 vs. 9 on 55 Ibis expressions which are part of a Data Warehouse built using https://github.com/binste/dbt-ibis. I've measured:
- Execution of Ibis code: The time it takes to execute the Ibis code to get to the final expression
- Convert Ibis expression to SQL: The time it takes to compile the final Ibis expression to SQL
Here some pseudo code to illustrate
import ibis
t = ibis.table(...)
# --------------------------
# START: Execution of Ibis code
# --------------------------
t = t.group_by(...).select(...)
# --------------------------
# END: Execution of Ibis code
# --------------------------
# --------------------------
# START: Convert Ibis expression to SQL
# --------------------------
ibis.to_sql(t)
# --------------------------
# END: Convert Ibis expression to SQL
# --------------------------
The great news is that the compilation to SQL got significantly faster with the move to SQLGlot which is super nice! :) The execution of Ibis code on the other hand got a bit slower with one expression taking significantly longer with 11 seconds. I've profiled that expression and most time is spent in the following statements:
- Table.relocate ibis/expr/types/relations.py:4279
- Code here looks like this:
.relocate("valid_to", after="valid_from")
- Code here looks like this:
- Join.wrapper ibis/expr/types/joins.py:188
- Table.drop ibis/expr/types/relations.py:2329
- Table.rename ibis/expr/types/relations.py:2140
Hope this helps!
@binste Thanks for digging into this. Interesting results!
Can you share the query that's now taking 11 seconds with Ibis 9.0?
Regarding drop and relocate, they're both implemented using a somewhat naive approach:
-
drop: project every column except the input columns -
relocate: basically a reprojection that loops through every column
I suspect that for drop we can turn it into a special operation that we should at least be able to make the operation scale with the number of dropped columns instead of the total number of columns. A place to start might be:
- create a special
Droprelational operation - implement the column drop in select merging by looping over the drop list and removing each element of the drop list from the
Selectfields.Selectfields are dicts, so that should get us toO(droplist)instead ofO(allcolumns).
We may be able to take a similar approach for relocate by using a data structure more optimized for the operation. I think something that has fast ordered inserts (perhaps sortedcontainers has a sorted map container?) might be a good place to start.
Unfortunately, I can't as I'd need to mask column names and code logic for IP reasons. It's 2 input tables with each around 50 columns and 1 table with ~10 columns and then various operations on top of it. But happy to test out any PRs if there is a wheel file available!
Naive benchmark here, but for a quick test:
# drop_test.py
import ibis
import time
from contextlib import contextmanager
@contextmanager
def tictoc(num_cols):
tic = time.time()
yield
toc = time.time()
print(f"| {num_cols} | {toc - tic} seconds")
print(f"{ibis.__version__=}")
for num_cols in [10, 20, 50, 100, 200, 500, 1000]:
t = ibis.table(name="t", schema=[(f"a{i}", "int") for i in range(num_cols)])
with tictoc(num_cols):
t.drop("a8")
🐚 python drop_test.py
ibis.__version__='8.0.0'
| 10 | 0.18016910552978516 seconds
| 20 | 0.0016529560089111328 seconds
| 50 | 0.0037081241607666016 seconds
| 100 | 0.007429361343383789 seconds
| 200 | 0.013902902603149414 seconds
| 500 | 0.03390693664550781 seconds
| 1000 | 0.06650948524475098 seconds
🐚 python drop_test.py
ibis.__version__='9.0.0'
| 10 | 0.002298593521118164 seconds
| 20 | 0.005956888198852539 seconds
| 50 | 0.027918338775634766 seconds
| 100 | 0.09690213203430176 seconds
| 200 | 0.36721324920654297 seconds
| 500 | 2.301510810852051 seconds
| 1000 | 9.317416191101074 seconds
@binste Any chance you can try with a pre-release wheel?
Benchmarking locally (I'll PR this code in a bit) I do still see linear scaling, but the constant factor isn't nearly as bad as @gforsyth's benchmark shows
ibis/tests/benchmarks/test_benchmarks.py ....... [100%]
----------------------------------------------------------------------------- benchmark 'test_wide_drop': 7 tests -----------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_wide_drop[10] 1.5730 (1.0) 2.1650 (1.0) 1.6427 (1.0) 0.0519 (1.0) 1.6336 (1.0) 0.0436 (1.22) 71;24 608.7550 (1.0) 404 1
test_wide_drop[20] 2.8863 (1.83) 3.5940 (1.66) 2.9740 (1.81) 0.0554 (1.07) 2.9666 (1.82) 0.0357 (1.0) 19;14 336.2452 (0.55) 317 1
test_wide_drop[50] 6.8432 (4.35) 7.4251 (3.43) 7.0100 (4.27) 0.0750 (1.45) 7.0007 (4.29) 0.0611 (1.71) 28;6 142.6526 (0.23) 138 1
test_wide_drop[100] 13.6984 (8.71) 14.2588 (6.59) 13.9413 (8.49) 0.1101 (2.12) 13.9258 (8.52) 0.1516 (4.25) 20;1 71.7293 (0.12) 70 1
test_wide_drop[200] 26.8306 (17.06) 28.1245 (12.99) 27.5633 (16.78) 0.4098 (7.90) 27.6801 (16.94) 0.7497 (21.03) 14;0 36.2801 (0.06) 36 1
test_wide_drop[500] 67.1504 (42.69) 112.0973 (51.78) 70.7434 (43.07) 11.4473 (220.75) 67.6779 (41.43) 0.7385 (20.72) 1;1 14.1356 (0.02) 15 1
test_wide_drop[1000] 135.7117 (86.27) 185.6611 (85.76) 147.9776 (90.08) 20.9796 (404.56) 137.2259 (84.00) 21.4762 (602.41) 2;0 6.7578 (0.01) 8 1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean
============================================================================================== 7 passed in 8.82s ==============================================================================================
Thanks for working on this! Yes, I can thest it with a pre-release wheel. Let me know when it's ready and I'll give it a try.
@binste The pre-release wheels are published every Sunday, so give it a go as soon as you're able to using something like this to install a pre-release wheel with ibis-framework + the duckdb backend dependencies:
pip install --pre -U 'ibis-framework[duckdb]`
Tested it with ibis-framework==10.0.0.dev120 and it's a big improvement! Executing all our internal dbt-ibis models takes now 8 seconds instead of 41. :) If the same could be done for relocate, rename, etc. then I'd assume that number goes down further but already thanks for this! Looking forward to version 10.
@binste Following up here, would you be able to profile your code again and determine whether relocate and rename are still performing undesirably?
I did a quick benchmark on rename() I reused Gil's example but now doing:
for num_cols in [10, 20, 50, 100, 200, 500, 1000]:
t = ibis.table(name="t", schema=[(f"MyCol{i}", "int") for i in range(num_cols)])
with tictoc(num_cols):
t.rename("snake_case")
and here are the results, this looks ok to me:
➜ python rename_test.py
ibis.__version__='8.0.0'
| 10 | 0.24251699447631836 seconds
| 20 | 0.0025930404663085938 seconds
| 50 | 0.009926080703735352 seconds
| 100 | 0.007010936737060547 seconds
| 200 | 0.021643877029418945 seconds
| 500 | 0.04538917541503906 seconds
| 1000 | 0.06621170043945312 seconds
➜ python rename_test.py
ibis.__version__='9.1.0'
| 10 | 0.0009710788726806641 seconds
| 20 | 0.0013387203216552734 seconds
| 50 | 0.0033159255981445312 seconds
| 100 | 0.006360054016113281 seconds
| 200 | 0.012788057327270508 seconds
| 500 | 0.031948089599609375 seconds
| 1000 | 0.06496000289916992 seconds
@ncclementi Thanks! Yeah, I see slightly different results for main:
| 10 | 0.0028498172760009766 seconds
| 20 | 0.0045757293701171875 seconds
| 50 | 0.011339426040649414 seconds
| 100 | 0.02110433578491211 seconds
| 200 | 0.04119515419006348 seconds
| 500 | 0.10248041152954102 seconds
| 1000 | 0.2085111141204834 seconds
Using pyinstrument, I see about a 2x speedup in rename from 8.0.0 to 9.1.0:
script:
from __future__ import annotations
import ibis
num_cols = 1000
t = ibis.table(name="t", schema=[(f"MyCol{i}", "int") for i in range(num_cols)])
method = "snake_case"
from pyinstrument import Profiler
profiler = Profiler()
profiler.start()
t.rename(method)
profiler.stop()
profiler.print()
I think there are a couple things left to explore here.
rename and relocate in the worst case need to look at every column, and that's currently the bottleneck AFAICT, though exploring with pyinstrument to see what is so expensive per column is worth looking into. That way we wouldn't have to gut anything in the top level method if we can figure out how to improve the per column cost.
For example, in a past version of a performance problem related to wide tables, we were doing some unnecessary validation of columns that were already part of a table and thus had already been validated. There might be something like that here.
I would first start by running the smallest possible bit of code that is slow through pyinstrument and then understanding whether the per column cost can be reduced.
For example, for rename, we probably don't need any of the DerefMap machinery that's happening in bind. DerefMap is for cases where there might otherwise be ambiguous parent tables, but in a rename we already know that everything is bound.
Same with the rewrite_project_input calls, we don't need them.
I'll throw up a PR to remove those and see what breaks.
For relocate the problem is a bit trickier, but since it's currently using the full binding machinery as well, we can probably shave a decent constant factor off the top, and follow up with an algorithmic improvements later.
@binste I've put up ~two~ three PRs that should hopefully put this issue to bed for now 😄.
#9641: rename performance
Avoids some redundant validation overhead that becomes expensive once there are enough columns.
Total improvement: 6-12x according to benchmarks.
#9644: relocate performance
This PR includes the changes from #9641 and #9646 and fixes an issue with unnecessary repeated traversal of every column.
Total improvement: 7-12x according to benchmarks.
#9646: drop performance
We tackled a big chunk of this in previous PR (#9440), but there was a bit left after poking around using the fast(er) column binding I discovered in #9641.
Total improvement: 2-3x according to benchmarks.
If you can give #9644 a try and report back that'd be great!
Great progress!! Happy to see these speedups, that's really going to help us with our dbt-ibis setup. Once there is a dev wheel file (I guess once #9644 is merged and there is a publishing on a Sunday), happy to try it out and report back! Thanks again.
@binste We'll cut 9.2.0 ~today, so you should be able to give this a go shortly 🚀
Importing, executing, and compiling the 56 Ibis expressions in our data warehouse to SQL now takes 6-7 seconds 🚀🚀🚀 That's much better, thank you very much for digging into this and fixing it!!
Maybe I get to do a more detailed benchmarking at a later stage to see if I can identify anything else but this definitely puts the issue to rest regarding wide tables :)
@binste Nice!
So, just to clarify, when you opened this issue you had a single expression that took ~30 seconds to compile and now every expression out of 56 compiles in 6-7 seconds?
Want to make sure I'm capturing the improvement accurately!
I don't have the exact figures anymore but it's roughly this:
- At one point, I think in Ibis 8 or 9.0, a single expression on a wide table took ~30 seconds to compile.
- Compiling all 56 expressions took between 30-50 seconds in Ibis 8/9.0 with improvements between 8 and 9
- Now, we're down to a total compilation time of ~6 seconds for all 56 expressions