ibis
ibis copied to clipboard
fix: rewrite the union to enable the union of large number of tables
Description of changes
When I am working on the https://github.com/ibis-project/ibis/pull/9139#issuecomment-2101711159, I had a RecursionError: maximum recursion depth exceeded while calling a Python object
when union large number of tables, I found we had this similar issues before, https://github.com/ibis-project/ibis/issues/7124 and https://github.com/tobymao/sqlglot/issues/2961.
I am trying to rewrite the union
in our codebase, It recursively union the tables using a queue, the number of unions is still similar to the original implementation, but it allow us to union more tables than before (It could unblock #9139 and added a test for larger number of tables where the original implementation will fail). Not sure if there is other side effect or not. If it is not a good solution, I could close this PR. Thanks for your review.
I was thinking to open an issue for discussion first, since I had some code ready, just open this for discussion, if it is not the best practice to file a PR, please let me know.
Update from the below comment
The updated implementation generates SQL in a manner that reduces the depth of nesting, thereby delaying the hitting of the recursion error in the SQLglot implementation
sql query: Union 7 tables in the new implementation
SELECT
*
FROM (
SELECT
*
FROM (
SELECT
*
FROM "ibis_pandas_memtable_3dqxlaaxrjbrhn6sezkq3jfey4" AS "t0"
UNION ALL
SELECT
*
FROM (
SELECT
*
FROM "ibis_pandas_memtable_gjf7ximrfnat7ocq7u3snufdwy" AS "t1"
UNION ALL
SELECT
*
FROM "ibis_pandas_memtable_khxo6jdnpnbc3lz4kdpkoll6ny" AS "t2"
) AS "t7"
) AS "t10"
UNION ALL
SELECT
*
FROM (
SELECT
*
FROM (
SELECT
*
FROM "ibis_pandas_memtable_4wg3k2vsvzf4tophotkj6n6ope" AS "t3"
UNION ALL
SELECT
*
FROM "ibis_pandas_memtable_zpp6dkndfnhjla473tjklyp3ku" AS "t4"
) AS "t8"
UNION ALL
SELECT
*
FROM (
SELECT
*
FROM "ibis_pandas_memtable_y5w7wpkhkbgr3agcuc3ftnujai" AS "t5"
UNION ALL
SELECT
*
FROM "ibis_pandas_memtable_ebrddatk6bfgti4x7gwrkgi3gm" AS "t6"
) AS "t9"
) AS "t11"
) AS "t12"
SQL query for the same union from the original implementation
SELECT
*
FROM (
SELECT
*
FROM (
SELECT
*
FROM (
SELECT
*
FROM (
SELECT
*
FROM (
SELECT
*
FROM (
SELECT
*
FROM "ibis_pandas_memtable_n4qzohbx3nfjdi6crezpa3zkfu" AS "t5"
UNION ALL
SELECT
*
FROM "ibis_pandas_memtable_3dhomfxvdfb2tezjssmbdkvvam" AS "t6"
) AS "t7"
UNION ALL
SELECT
*
FROM "ibis_pandas_memtable_mqknkr6dr5gv7hpcrri34hxydq" AS "t4"
) AS "t8"
UNION ALL
SELECT
*
FROM "ibis_pandas_memtable_v5wru54hhfabxd7khavq27zruq" AS "t3"
) AS "t9"
UNION ALL
SELECT
*
FROM "ibis_pandas_memtable_m4hgmj33nreopoxvxwhkycicfy" AS "t2"
) AS "t10"
UNION ALL
SELECT
*
FROM "ibis_pandas_memtable_lilxpxko6jh4lplqw2kp6jny3y" AS "t1"
) AS "t11"
UNION ALL
SELECT
*
FROM "ibis_pandas_memtable_mxiz4rs5z5dohdjxror5hm6p5a" AS "t0"
) AS "t12"
Issues closed
So currently both the ibis and sqlglot representations are deeply nested using pairwise union operations. We could change it on the ibis side to be variadic but we would still hit the recursion error in the sqlglot implementation, but at least the error wouldn't come from our side.
we would still hit the recursion error in the sqlglot implementation
The updated implementation generates SQL in a manner that reduces the depth of nesting, thereby delaying the hitting of the recursion error in the SQLglot implementation.
See the generated sql query fabove.
my current unit test uses ibis.to_sql()
or execute()
, it does not work under the current path, I am looking for some suggestions too, where or how to do the unit test.
Ok, so the new compile benchmark simply cannot be run on main
, because it hits a recursion error.
After this PR, the benchmarks look like this:
-------------------------------------- benchmark 'test_large_union_compile': 1 tests --------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
---------------------------------------------------------------------------------------------------------------------------
test_large_union_compile 807.3274 871.2054 825.3349 26.0861 817.4346 21.9607 1;1 1.2116 5 1
---------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------ benchmark 'test_large_union_construct': 2 tests ------------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_union_construct (0001_4c0ba2c) 79.3308 (1.00) 83.1229 (1.00) 80.6946 (1.00) 1.1855 (1.0) 80.0964 (1.0) 1.7544 (1.0) 4;0 12.3924 (1.00) 13 1
test_large_union_construct (NOW) 79.0677 (1.0) 82.8776 (1.0) 80.6321 (1.0) 1.2461 (1.05) 80.2037 (1.00) 1.8576 (1.06) 4;0 12.4020 (1.0) 12 1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean
Merging!