datafusion Better CSE identifier

This is a draft PR that implements the ideas from https://github.com/apache/datafusion/issues/10426#issuecomment-2105664520.

Which issue does this PR close?

Closes https://github.com/apache/datafusion/issues/10426.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

May 12 '24 18:05 peter-toth

I think @peter-toth plans to break this PR up into smaller ones, so marking it as a draft to make it clear it isn't waiting on more feedback. If I am mistaken, please let me know

May 15 '24 19:05 alamb

I think @peter-toth plans to break this PR up into smaller ones, so marking it as a draft to make it clear it isn't waiting on more feedback. If I am mistaken, please let me know

Yes, here is the first part that adds the new TreeNode APIs: https://github.com/apache/datafusion/pull/10543

May 16 '24 10:05 peter-toth

@alamb, can you please help me with that MSRV failure? I don't know what could be the source of the failure and if I try running cargo msrv verify locally it fails with:

% cargo msrv verify
Fetching index
Unable to find key 'package.rust-version' (or 'package.metadata.msrv') in '/Users/ptoth/git/apache/arrow-datafusion/Cargo.toml'

Jun 21 '24 09:06 peter-toth

This PR is more or less ready for review, tests are passing except for the MSRV. I focused only on the 3 performance improvements and deliberately kept the code as close to the original as possible to ease review. I plan to open a follow-up PR to tackle remaining TODOs and refactor CSE code a bit for better readability.

Local benchmarks show good perfroamance improvements, but @alamb please confirm it with your standard setup.

% critcmp main better-cse-identifier
group                                         better-cse-identifier                   main
-----                                         ---------------------                   ----
logical_aggregate_with_join                   1.00    529.3±8.70µs        ? ?/sec     1.00    528.4±6.17µs        ? ?/sec
logical_plan_tpcds_all                        1.00     71.3±2.74ms        ? ?/sec     1.06    75.4±16.47ms        ? ?/sec
logical_plan_tpch_all                         1.00      7.3±0.19ms        ? ?/sec     1.09      7.9±2.34ms        ? ?/sec
logical_select_all_from_1000                  1.02     15.9±2.00ms        ? ?/sec     1.00     15.6±1.27ms        ? ?/sec
logical_select_one_from_700                   1.00    399.9±3.73µs        ? ?/sec     1.02    408.3±4.29µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00   422.7±92.70µs        ? ?/sec     1.00   423.7±88.41µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00    380.5±5.11µs        ? ?/sec     1.01    385.3±4.22µs        ? ?/sec
physical_plan_tpcds_all                       1.00   536.8±11.44ms        ? ?/sec     1.09   586.2±20.88ms        ? ?/sec
physical_plan_tpch_all                        1.01    38.5±12.88ms        ? ?/sec     1.00     38.1±0.59ms        ? ?/sec
physical_plan_tpch_q1                         1.00  1108.3±12.50µs        ? ?/sec     1.43  1585.0±12.90µs        ? ?/sec
physical_plan_tpch_q10                        1.00  1598.6±219.13µs        ? ?/sec    1.15  1842.3±280.13µs        ? ?/sec
physical_plan_tpch_q11                        1.00  1339.7±14.58µs        ? ?/sec     1.15  1540.2±247.46µs        ? ?/sec
physical_plan_tpch_q12                        1.00  1117.3±63.22µs        ? ?/sec     1.12  1252.1±17.87µs        ? ?/sec
physical_plan_tpch_q13                        1.00   769.6±19.54µs        ? ?/sec     1.12   864.0±12.96µs        ? ?/sec
physical_plan_tpch_q14                        1.00  930.6±216.63µs        ? ?/sec     1.16  1081.1±125.78µs        ? ?/sec
physical_plan_tpch_q16                        1.00  1331.4±38.36µs        ? ?/sec     1.15  1534.8±261.21µs        ? ?/sec
physical_plan_tpch_q17                        1.00  1209.8±29.88µs        ? ?/sec     1.09  1322.0±17.01µs        ? ?/sec
physical_plan_tpch_q18                        1.00  1532.7±240.18µs        ? ?/sec    1.04  1593.0±20.68µs        ? ?/sec
physical_plan_tpch_q19                        1.00      2.5±0.08ms        ? ?/sec     1.06      2.7±0.04ms        ? ?/sec
physical_plan_tpch_q2                         1.00      3.2±0.70ms        ? ?/sec     1.06      3.4±0.72ms        ? ?/sec
physical_plan_tpch_q20                        1.00  1642.3±16.61µs        ? ?/sec     1.12  1845.0±302.12µs        ? ?/sec
physical_plan_tpch_q21                        1.00      2.6±0.65ms        ? ?/sec     1.07      2.7±0.64ms        ? ?/sec
physical_plan_tpch_q22                        1.00  1196.6±101.79µs        ? ?/sec    1.08  1292.8±15.01µs        ? ?/sec
physical_plan_tpch_q3                         1.00  1105.8±28.61µs        ? ?/sec     1.23  1358.9±236.19µs        ? ?/sec
physical_plan_tpch_q4                         1.00   852.5±13.28µs        ? ?/sec     1.12  957.4±161.05µs        ? ?/sec
physical_plan_tpch_q5                         1.00  1685.6±23.24µs        ? ?/sec     1.10  1857.1±21.46µs        ? ?/sec
physical_plan_tpch_q6                         1.00  642.0±188.39µs        ? ?/sec     1.01   647.4±12.04µs        ? ?/sec
physical_plan_tpch_q7                         1.00      2.2±0.45ms        ? ?/sec     1.08      2.4±0.54ms        ? ?/sec
physical_plan_tpch_q8                         1.00      2.7±0.05ms        ? ?/sec     1.08      2.9±0.06ms        ? ?/sec
physical_plan_tpch_q9                         1.00  1964.3±55.93µs        ? ?/sec     1.08      2.1±0.04ms        ? ?/sec

Jun 21 '24 15:06 peter-toth

I am starting the benchmark run now -- I'll report back here and give this PR a look as soon as possible (but maybe not until tomorrow)

Thank you so much @peter-toth

Jun 21 '24 17:06 alamb

My results appear consistent with yours @peter-toth -- looks like maybe the high column count case is slightly worse -- I can maybe profile it to see if I can find any reason that might be

++ critcmp main better-cse-identifier
group                                         better-cse-identifier                  main
-----                                         ---------------------                  ----
logical_aggregate_with_join                   1.01  1023.5±41.82µs        ? ?/sec    1.00  1008.6±10.43µs        ? ?/sec
logical_plan_tpcds_all                        1.00    152.4±1.06ms        ? ?/sec    1.00    152.2±1.00ms        ? ?/sec
logical_plan_tpch_all                         1.00     16.9±0.19ms        ? ?/sec    1.01     17.1±0.26ms        ? ?/sec
logical_select_all_from_1000                  1.02     18.9±0.12ms        ? ?/sec    1.00     18.6±1.74ms        ? ?/sec
logical_select_one_from_700                   1.01    829.8±9.09µs        ? ?/sec    1.00   823.9±10.68µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00   774.1±10.87µs        ? ?/sec    1.00   770.5±27.71µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00   765.0±32.06µs        ? ?/sec    1.00    763.9±9.85µs        ? ?/sec
physical_plan_tpcds_all                       1.00  1073.0±12.72ms        ? ?/sec    1.08   1159.6±5.63ms        ? ?/sec
physical_plan_tpch_all                        1.00     71.7±0.67ms        ? ?/sec    1.10     79.0±0.88ms        ? ?/sec
physical_plan_tpch_q1                         1.00      2.5±0.04ms        ? ?/sec    1.36      3.5±0.02ms        ? ?/sec
physical_plan_tpch_q10                        1.00      3.6±0.03ms        ? ?/sec    1.12      4.0±0.03ms        ? ?/sec
physical_plan_tpch_q11                        1.00      3.1±0.03ms        ? ?/sec    1.09      3.4±0.03ms        ? ?/sec
physical_plan_tpch_q12                        1.00      2.4±0.01ms        ? ?/sec    1.12      2.7±0.02ms        ? ?/sec
physical_plan_tpch_q13                        1.00  1805.2±95.39µs        ? ?/sec    1.11      2.0±0.01ms        ? ?/sec
physical_plan_tpch_q14                        1.00      2.0±0.03ms        ? ?/sec    1.18      2.4±0.02ms        ? ?/sec
physical_plan_tpch_q16                        1.00      3.0±0.03ms        ? ?/sec    1.11      3.3±0.03ms        ? ?/sec
physical_plan_tpch_q17                        1.00      2.8±0.02ms        ? ?/sec    1.09      3.0±0.03ms        ? ?/sec
physical_plan_tpch_q18                        1.00      3.2±0.03ms        ? ?/sec    1.09      3.5±0.02ms        ? ?/sec
physical_plan_tpch_q19                        1.00      4.9±0.06ms        ? ?/sec    1.05      5.2±0.05ms        ? ?/sec
physical_plan_tpch_q2                         1.00      6.3±0.04ms        ? ?/sec    1.05      6.6±0.05ms        ? ?/sec
physical_plan_tpch_q20                        1.00      3.7±0.04ms        ? ?/sec    1.07      3.9±0.04ms        ? ?/sec
physical_plan_tpch_q21                        1.00      5.1±0.06ms        ? ?/sec    1.06      5.4±0.03ms        ? ?/sec
physical_plan_tpch_q22                        1.00      2.7±0.02ms        ? ?/sec    1.10      3.0±0.03ms        ? ?/sec
physical_plan_tpch_q3                         1.00      2.6±0.03ms        ? ?/sec    1.16      2.9±0.02ms        ? ?/sec
physical_plan_tpch_q4                         1.00  1956.8±10.71µs        ? ?/sec    1.09      2.1±0.02ms        ? ?/sec
physical_plan_tpch_q5                         1.00      3.7±0.04ms        ? ?/sec    1.09      4.1±0.02ms        ? ?/sec
physical_plan_tpch_q6                         1.00   1316.0±9.67µs        ? ?/sec    1.10  1452.2±16.94µs        ? ?/sec
physical_plan_tpch_q7                         1.00      4.7±0.03ms        ? ?/sec    1.04      4.8±0.03ms        ? ?/sec
physical_plan_tpch_q8                         1.00      5.8±0.04ms        ? ?/sec    1.07      6.2±0.04ms        ? ?/sec
physical_plan_tpch_q9                         1.00      4.4±0.03ms        ? ?/sec    1.07      4.7±0.03ms        ? ?/sec
physical_select_all_from_1000                 1.04     46.8±0.32ms        ? ?/sec    1.00     45.1±0.20ms        ? ?/sec
physical_select_one_from_700                  1.03      3.5±0.02ms        ? ?/sec    1.00      3.4±0.02ms        ? ?/sec

Jun 21 '24 18:06 alamb

@alamb, can you please help me with that MSRV failure? I don't know what could be the source of the failure and if I try running cargo msrv verify locally it fails with:
% cargo msrv verify
Fetching index
Unable to find key 'package.rust-version' (or 'package.metadata.msrv') in '/Users/ptoth/git/apache/arrow-datafusion/Cargo.toml'

I've bumped MSRV to 1.76 in https://github.com/apache/datafusion/pull/10473/commits/ccc92b9ab00813a69583abe2ec4ab5517107be35. It we want to avoid that then let me know and I will try to find out what is not available in 1.75.

Jun 22 '24 09:06 peter-toth

I ran the benchmarks again:

Looks to me like this PR makes planning 10% faster

physical_plan_tpcds_all                       1.00   1074.8±5.94ms        ? ?/sec    1.10   1181.1±6.54ms        ? ?/sec
physical_plan_tpch_all                        1.00     72.2±0.77ms        ? ?/sec    1.11     79.8±0.67ms        ? ?/sec

🚀

++ critcmp main better-cse-identifier
group                                         better-cse-identifier                  main
-----                                         ---------------------                  ----
logical_aggregate_with_join                   1.00  1038.1±44.37µs        ? ?/sec    1.00  1040.8±49.37µs        ? ?/sec
logical_plan_tpcds_all                        1.00    153.0±1.07ms        ? ?/sec    1.01    154.3±0.98ms        ? ?/sec
logical_plan_tpch_all                         1.00     17.1±0.20ms        ? ?/sec    1.01     17.2±0.20ms        ? ?/sec
logical_select_all_from_1000                  1.00     18.9±0.09ms        ? ?/sec    1.01     19.0±0.11ms        ? ?/sec
logical_select_one_from_700                   1.01   840.6±10.15µs        ? ?/sec    1.00    836.0±9.66µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00   789.5±33.87µs        ? ?/sec    1.00   791.5±15.53µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00   771.3±21.54µs        ? ?/sec    1.02   785.4±21.89µs        ? ?/sec
physical_plan_tpcds_all                       1.00   1074.8±5.94ms        ? ?/sec    1.10   1181.1±6.54ms        ? ?/sec
physical_plan_tpch_all                        1.00     72.2±0.77ms        ? ?/sec    1.11     79.8±0.67ms        ? ?/sec
physical_plan_tpch_q1                         1.00      2.6±0.02ms        ? ?/sec    1.38      3.5±0.03ms        ? ?/sec
physical_plan_tpch_q10                        1.00      3.6±0.03ms        ? ?/sec    1.14      4.1±0.06ms        ? ?/sec
physical_plan_tpch_q11                        1.00      3.1±0.03ms        ? ?/sec    1.11      3.4±0.04ms        ? ?/sec
physical_plan_tpch_q12                        1.00      2.4±0.02ms        ? ?/sec    1.13      2.8±0.02ms        ? ?/sec
physical_plan_tpch_q13                        1.00  1814.3±19.22µs        ? ?/sec    1.12      2.0±0.02ms        ? ?/sec
physical_plan_tpch_q14                        1.00      2.0±0.02ms        ? ?/sec    1.18      2.4±0.02ms        ? ?/sec
physical_plan_tpch_q16                        1.00      3.0±0.03ms        ? ?/sec    1.12      3.4±0.03ms        ? ?/sec
physical_plan_tpch_q17                        1.00      2.8±0.03ms        ? ?/sec    1.10      3.1±0.03ms        ? ?/sec
physical_plan_tpch_q18                        1.00      3.3±0.03ms        ? ?/sec    1.11      3.6±0.04ms        ? ?/sec
physical_plan_tpch_q19                        1.00      4.8±0.05ms        ? ?/sec    1.07      5.2±0.05ms        ? ?/sec
physical_plan_tpch_q2                         1.00      6.4±0.15ms        ? ?/sec    1.05      6.7±0.05ms        ? ?/sec
physical_plan_tpch_q20                        1.00      3.6±0.03ms        ? ?/sec    1.08      3.9±0.03ms        ? ?/sec
physical_plan_tpch_q21                        1.00      5.1±0.07ms        ? ?/sec    1.08      5.4±0.04ms        ? ?/sec
physical_plan_tpch_q22                        1.00      2.7±0.03ms        ? ?/sec    1.10      3.0±0.02ms        ? ?/sec
physical_plan_tpch_q3                         1.00      2.6±0.02ms        ? ?/sec    1.16      3.0±0.02ms        ? ?/sec
physical_plan_tpch_q4                         1.00  1974.3±23.01µs        ? ?/sec    1.09      2.1±0.03ms        ? ?/sec
physical_plan_tpch_q5                         1.00      3.7±0.04ms        ? ?/sec    1.10      4.1±0.06ms        ? ?/sec
physical_plan_tpch_q6                         1.00  1321.5±35.02µs        ? ?/sec    1.11  1472.4±20.45µs        ? ?/sec
physical_plan_tpch_q7                         1.00      4.7±0.04ms        ? ?/sec    1.05      4.9±0.05ms        ? ?/sec
physical_plan_tpch_q8                         1.00      5.9±0.08ms        ? ?/sec    1.09      6.4±0.06ms        ? ?/sec
physical_plan_tpch_q9                         1.00      4.4±0.05ms        ? ?/sec    1.08      4.8±0.04ms        ? ?/sec
physical_select_all_from_1000                 1.00     46.7±0.40ms        ? ?/sec    1.02     47.7±0.29ms        ? ?/sec
physical_select_one_from_700                  1.00      3.5±0.02ms        ? ?/sec    1.01      3.5±0.03ms        ? ?/sec

Jun 23 '24 12:06 alamb

🚀 thanks again @peter-toth

Jun 24 '24 20:06 alamb

datafusion datafusion copied to clipboard

Better CSE identifier

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

datafusion
datafusion copied to clipboard