datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Use btree to search fields in DFSchema

Open oleggator opened this issue 1 year ago • 10 comments

Which issue does this PR close?

Part of #7698.

Rationale for this change

Current DFSchema implementation uses vector to operate with fields. It makes search of a column by name algorithmically complex.

What changes are included in this PR?

Use BTreeMap to index field qualifiers.

Are these changes tested?

Are there any user-facing changes?

No

oleggator avatar Oct 19 '23 13:10 oleggator

Is there a reason to use a b-tree ( $\mathrm{O}(\log{n})$ ) vs a hash map ( $\mathrm{O}(1)$ )?

crepererum avatar Oct 23 '23 15:10 crepererum

I plan to review this and related PRs tomorrow morning

alamb avatar Oct 25 '23 22:10 alamb

Related comment: https://github.com/apache/arrow-datafusion/issues/7698#issuecomment-1781787244

alamb avatar Oct 26 '23 19:10 alamb

Is there a reason to use a b-tree ( O(log⁡n) ) vs a hash map ( O(1) )?

Using b-tree we can query all fields matching to a "prefix" in one O(logn) hop (column.*.*.*, column.table.*.*, column.table.schema.*, column.table.schema.catalog). It is used in fields_with_unqualified_name method to query all fields by specific name.

oleggator avatar Oct 26 '23 21:10 oleggator

Is there a reason to use a b-tree ( O(log⁡n) ) vs a hash map ( O(1) )?

Using b-tree we can query all fields matching to a "prefix" in one O(logn) hop (column.*.*.*, column.table.*.*, column.table.schema.*, column.table.schema.catalog). It is used in fields_with_unqualified_name method to query all fields by specific name.

Is that such a common operation that it is worth to keep an expensive index on every single schema in the query graph? I think the planner that resolves these names can easily order the fields and build this index locally.

crepererum avatar Oct 27 '23 10:10 crepererum

Made a benchmark.

Baseline - Data Fusion 32 (a0c5affca271d67980286cb2ae08ea8eec75a326)

index_of_column_by_name 10
                        time:   [11.323 ns 11.325 ns 11.328 ns]
                        change: [-0.0714% +0.3045% +0.6180%] (p = 0.09 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

index_of_column_by_name 20
                        time:   [4.1947 ns 4.1963 ns 4.1981 ns]
                        change: [-2.1038% -1.5880% -1.2714%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

index_of_column_by_name 50
                        time:   [34.841 ns 34.851 ns 34.871 ns]
                        change: [-0.2590% -0.1783% -0.0774%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

index_of_column_by_name 100
                        time:   [88.736 ns 88.927 ns 89.119 ns]
                        change: [+4.6597% +5.0086% +5.3786%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild

index_of_column_by_name 500
                        time:   [403.20 ns 403.70 ns 404.29 ns]
                        change: [+1.5771% +1.6483% +1.7326%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  4 (4.00%) high severe

index_of_column_by_name 1000
                        time:   [909.73 ns 910.11 ns 910.48 ns]
                        change: [-2.0626% -1.6648% -1.3588%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

DFSchema::new 10        time:   [328.91 ns 329.14 ns 329.38 ns]
                        change: [-0.8652% -0.8013% -0.7418%] (p = 0.00 < 0.05)
                        Change within noise threshold.

DFSchema::new 20        time:   [725.37 ns 725.93 ns 726.56 ns]
                        change: [+0.4542% +0.5177% +0.5841%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

DFSchema::new 50        time:   [1.6864 µs 1.6892 µs 1.6924 µs]
                        change: [+1.3382% +1.4765% +1.6362%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

DFSchema::new 100       time:   [3.4953 µs 3.4965 µs 3.4978 µs]
                        change: [-3.4655% -3.2889% -3.1317%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  2 (2.00%) high severe

DFSchema::new 500       time:   [23.470 µs 23.477 µs 23.485 µs]
                        change: [-1.8427% -1.7821% -1.7253%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

DFSchema::new 1000      time:   [45.504 µs 45.515 µs 45.528 µs]
                        change: [-2.8088% -2.6555% -2.4933%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

cargo bench  172.06s user 0.50s system 153% cpu 1:52.07 total

This PR

index_of_column_by_name 10
                        time:   [33.607 ns 33.663 ns 33.717 ns]
                        change: [+196.44% +196.92% +197.41%] (p = 0.00 < 0.05)
                        Performance has regressed.

index_of_column_by_name 20
                        time:   [21.509 ns 21.522 ns 21.535 ns]
                        change: [+412.46% +412.90% +413.42%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

index_of_column_by_name 50
                        time:   [43.590 ns 43.651 ns 43.713 ns]
                        change: [+24.956% +25.143% +25.325%] (p = 0.00 < 0.05)
                        Performance has regressed.

index_of_column_by_name 100
                        time:   [68.349 ns 68.373 ns 68.401 ns]
                        change: [-23.444% -23.221% -22.998%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

index_of_column_by_name 500
                        time:   [65.428 ns 65.444 ns 65.461 ns]
                        change: [-83.785% -83.768% -83.752%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

index_of_column_by_name 1000
                        time:   [74.167 ns 74.174 ns 74.183 ns]
                        change: [-91.855% -91.850% -91.844%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

DFSchema::new 10        time:   [956.63 ns 957.20 ns 957.81 ns]
                        change: [+190.77% +191.00% +191.28%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

DFSchema::new 20        time:   [2.4375 µs 2.4384 µs 2.4393 µs]
                        change: [+235.82% +236.06% +236.36%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

DFSchema::new 50        time:   [6.5247 µs 6.5275 µs 6.5303 µs]
                        change: [+287.52% +288.07% +288.63%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

DFSchema::new 100       time:   [15.298 µs 15.330 µs 15.368 µs]
                        change: [+337.14% +340.86% +347.06%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) low mild
  6 (6.00%) high mild
  5 (5.00%) high severe

DFSchema::new 500       time:   [92.211 µs 92.284 µs 92.361 µs]
                        change: [+292.82% +293.14% +293.47%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild

DFSchema::new 1000      time:   [204.70 µs 204.87 µs 205.05 µs]
                        change: [+349.22% +349.78% +350.32%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

cargo bench  252.05s user 1.60s system 150% cpu 2:48.82 total

oleggator avatar Oct 27 '23 12:10 oleggator

Made a benchmark.

Baseline - Data Fusion 32 (a0c5aff)

index_of_column_by_name 10
                        time:   [11.323 ns 11.325 ns 11.328 ns]
                        change: [-0.0714% +0.3045% +0.6180%] (p = 0.09 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

index_of_column_by_name 20
                        time:   [4.1947 ns 4.1963 ns 4.1981 ns]
                        change: [-2.1038% -1.5880% -1.2714%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

index_of_column_by_name 50
                        time:   [34.841 ns 34.851 ns 34.871 ns]
                        change: [-0.2590% -0.1783% -0.0774%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

index_of_column_by_name 100
                        time:   [88.736 ns 88.927 ns 89.119 ns]
                        change: [+4.6597% +5.0086% +5.3786%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild

index_of_column_by_name 500
                        time:   [403.20 ns 403.70 ns 404.29 ns]
                        change: [+1.5771% +1.6483% +1.7326%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  4 (4.00%) high severe

index_of_column_by_name 1000
                        time:   [909.73 ns 910.11 ns 910.48 ns]
                        change: [-2.0626% -1.6648% -1.3588%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

DFSchema::new 10        time:   [328.91 ns 329.14 ns 329.38 ns]
                        change: [-0.8652% -0.8013% -0.7418%] (p = 0.00 < 0.05)
                        Change within noise threshold.

DFSchema::new 20        time:   [725.37 ns 725.93 ns 726.56 ns]
                        change: [+0.4542% +0.5177% +0.5841%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

DFSchema::new 50        time:   [1.6864 µs 1.6892 µs 1.6924 µs]
                        change: [+1.3382% +1.4765% +1.6362%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

DFSchema::new 100       time:   [3.4953 µs 3.4965 µs 3.4978 µs]
                        change: [-3.4655% -3.2889% -3.1317%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  2 (2.00%) high severe

DFSchema::new 500       time:   [23.470 µs 23.477 µs 23.485 µs]
                        change: [-1.8427% -1.7821% -1.7253%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

DFSchema::new 1000      time:   [45.504 µs 45.515 µs 45.528 µs]
                        change: [-2.8088% -2.6555% -2.4933%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

cargo bench  172.06s user 0.50s system 153% cpu 1:52.07 total

This PR

index_of_column_by_name 10
                        time:   [33.607 ns 33.663 ns 33.717 ns]
                        change: [+196.44% +196.92% +197.41%] (p = 0.00 < 0.05)
                        Performance has regressed.

index_of_column_by_name 20
                        time:   [21.509 ns 21.522 ns 21.535 ns]
                        change: [+412.46% +412.90% +413.42%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

index_of_column_by_name 50
                        time:   [43.590 ns 43.651 ns 43.713 ns]
                        change: [+24.956% +25.143% +25.325%] (p = 0.00 < 0.05)
                        Performance has regressed.

index_of_column_by_name 100
                        time:   [68.349 ns 68.373 ns 68.401 ns]
                        change: [-23.444% -23.221% -22.998%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

index_of_column_by_name 500
                        time:   [65.428 ns 65.444 ns 65.461 ns]
                        change: [-83.785% -83.768% -83.752%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

index_of_column_by_name 1000
                        time:   [74.167 ns 74.174 ns 74.183 ns]
                        change: [-91.855% -91.850% -91.844%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

DFSchema::new 10        time:   [956.63 ns 957.20 ns 957.81 ns]
                        change: [+190.77% +191.00% +191.28%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

DFSchema::new 20        time:   [2.4375 µs 2.4384 µs 2.4393 µs]
                        change: [+235.82% +236.06% +236.36%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

DFSchema::new 50        time:   [6.5247 µs 6.5275 µs 6.5303 µs]
                        change: [+287.52% +288.07% +288.63%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

DFSchema::new 100       time:   [15.298 µs 15.330 µs 15.368 µs]
                        change: [+337.14% +340.86% +347.06%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) low mild
  6 (6.00%) high mild
  5 (5.00%) high severe

DFSchema::new 500       time:   [92.211 µs 92.284 µs 92.361 µs]
                        change: [+292.82% +293.14% +293.47%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild

DFSchema::new 1000      time:   [204.70 µs 204.87 µs 205.05 µs]
                        change: [+349.22% +349.78% +350.32%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

cargo bench  252.05s user 1.60s system 150% cpu 2:48.82 total

Could you please add summary?

It seems that btree provides an advantage with 100+ cols

karlovnv avatar Oct 31 '23 19:10 karlovnv

Thank you -- I plan to review this more carefully tomorrow

alamb avatar Oct 31 '23 22:10 alamb

Thank you -- I plan to review this more carefully tomorrow

@alamb I think it's a good idea to introduce user defined cacheprovider for both DFSchema and arrow Schema. It will allow to take benefits from btree and avoid building it when is not necessary. My assumption is that user knows when schema become invalid and can manage it invalidation from the cache

karlovnv avatar Nov 03 '23 13:11 karlovnv

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Apr 25 '24 01:04 github-actions[bot]