datafusion
datafusion copied to clipboard
Use btree to search fields in DFSchema
Which issue does this PR close?
Part of #7698.
Rationale for this change
Current DFSchema implementation uses vector to operate with fields. It makes search of a column by name algorithmically complex.
What changes are included in this PR?
Use BTreeMap to index field qualifiers.
Are these changes tested?
Are there any user-facing changes?
No
Is there a reason to use a b-tree ( $\mathrm{O}(\log{n})$ ) vs a hash map ( $\mathrm{O}(1)$ )?
I plan to review this and related PRs tomorrow morning
Related comment: https://github.com/apache/arrow-datafusion/issues/7698#issuecomment-1781787244
Is there a reason to use a b-tree ( O(logn) ) vs a hash map ( O(1) )?
Using b-tree we can query all fields matching to a "prefix" in one O(logn) hop (column.*.*.*
, column.table.*.*
, column.table.schema.*
, column.table.schema.catalog
).
It is used in fields_with_unqualified_name
method to query all fields by specific name.
Is there a reason to use a b-tree ( O(logn) ) vs a hash map ( O(1) )?
Using b-tree we can query all fields matching to a "prefix" in one O(logn) hop (
column.*.*.*
,column.table.*.*
,column.table.schema.*
,column.table.schema.catalog
). It is used infields_with_unqualified_name
method to query all fields by specific name.
Is that such a common operation that it is worth to keep an expensive index on every single schema in the query graph? I think the planner that resolves these names can easily order the fields and build this index locally.
Made a benchmark.
Baseline - Data Fusion 32 (a0c5affca271d67980286cb2ae08ea8eec75a326)
index_of_column_by_name 10
time: [11.323 ns 11.325 ns 11.328 ns]
change: [-0.0714% +0.3045% +0.6180%] (p = 0.09 > 0.05)
No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
index_of_column_by_name 20
time: [4.1947 ns 4.1963 ns 4.1981 ns]
change: [-2.1038% -1.5880% -1.2714%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
index_of_column_by_name 50
time: [34.841 ns 34.851 ns 34.871 ns]
change: [-0.2590% -0.1783% -0.0774%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
1 (1.00%) low severe
4 (4.00%) low mild
5 (5.00%) high mild
3 (3.00%) high severe
index_of_column_by_name 100
time: [88.736 ns 88.927 ns 89.119 ns]
change: [+4.6597% +5.0086% +5.3786%] (p = 0.00 < 0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low mild
4 (4.00%) high mild
index_of_column_by_name 500
time: [403.20 ns 403.70 ns 404.29 ns]
change: [+1.5771% +1.6483% +1.7326%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low severe
3 (3.00%) low mild
4 (4.00%) high severe
index_of_column_by_name 1000
time: [909.73 ns 910.11 ns 910.48 ns]
change: [-2.0626% -1.6648% -1.3588%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
DFSchema::new 10 time: [328.91 ns 329.14 ns 329.38 ns]
change: [-0.8652% -0.8013% -0.7418%] (p = 0.00 < 0.05)
Change within noise threshold.
DFSchema::new 20 time: [725.37 ns 725.93 ns 726.56 ns]
change: [+0.4542% +0.5177% +0.5841%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
3 (3.00%) high mild
2 (2.00%) high severe
DFSchema::new 50 time: [1.6864 µs 1.6892 µs 1.6924 µs]
change: [+1.3382% +1.4765% +1.6362%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
DFSchema::new 100 time: [3.4953 µs 3.4965 µs 3.4978 µs]
change: [-3.4655% -3.2889% -3.1317%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low severe
1 (1.00%) high mild
2 (2.00%) high severe
DFSchema::new 500 time: [23.470 µs 23.477 µs 23.485 µs]
change: [-1.8427% -1.7821% -1.7253%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
DFSchema::new 1000 time: [45.504 µs 45.515 µs 45.528 µs]
change: [-2.8088% -2.6555% -2.4933%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
cargo bench 172.06s user 0.50s system 153% cpu 1:52.07 total
This PR
index_of_column_by_name 10
time: [33.607 ns 33.663 ns 33.717 ns]
change: [+196.44% +196.92% +197.41%] (p = 0.00 < 0.05)
Performance has regressed.
index_of_column_by_name 20
time: [21.509 ns 21.522 ns 21.535 ns]
change: [+412.46% +412.90% +413.42%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
index_of_column_by_name 50
time: [43.590 ns 43.651 ns 43.713 ns]
change: [+24.956% +25.143% +25.325%] (p = 0.00 < 0.05)
Performance has regressed.
index_of_column_by_name 100
time: [68.349 ns 68.373 ns 68.401 ns]
change: [-23.444% -23.221% -22.998%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low severe
2 (2.00%) low mild
4 (4.00%) high mild
1 (1.00%) high severe
index_of_column_by_name 500
time: [65.428 ns 65.444 ns 65.461 ns]
change: [-83.785% -83.768% -83.752%] (p = 0.00 < 0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) low severe
1 (1.00%) low mild
4 (4.00%) high mild
3 (3.00%) high severe
index_of_column_by_name 1000
time: [74.167 ns 74.174 ns 74.183 ns]
change: [-91.855% -91.850% -91.844%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low severe
1 (1.00%) low mild
3 (3.00%) high mild
3 (3.00%) high severe
DFSchema::new 10 time: [956.63 ns 957.20 ns 957.81 ns]
change: [+190.77% +191.00% +191.28%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
DFSchema::new 20 time: [2.4375 µs 2.4384 µs 2.4393 µs]
change: [+235.82% +236.06% +236.36%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) low mild
1 (1.00%) high mild
2 (2.00%) high severe
DFSchema::new 50 time: [6.5247 µs 6.5275 µs 6.5303 µs]
change: [+287.52% +288.07% +288.63%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe
DFSchema::new 100 time: [15.298 µs 15.330 µs 15.368 µs]
change: [+337.14% +340.86% +347.06%] (p = 0.00 < 0.05)
Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
4 (4.00%) low mild
6 (6.00%) high mild
5 (5.00%) high severe
DFSchema::new 500 time: [92.211 µs 92.284 µs 92.361 µs]
change: [+292.82% +293.14% +293.47%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) low mild
DFSchema::new 1000 time: [204.70 µs 204.87 µs 205.05 µs]
change: [+349.22% +349.78% +350.32%] (p = 0.00 < 0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
cargo bench 252.05s user 1.60s system 150% cpu 2:48.82 total
Made a benchmark.
Baseline - Data Fusion 32 (a0c5aff)
index_of_column_by_name 10 time: [11.323 ns 11.325 ns 11.328 ns] change: [-0.0714% +0.3045% +0.6180%] (p = 0.09 > 0.05) No change in performance detected. Found 6 outliers among 100 measurements (6.00%) 2 (2.00%) low mild 3 (3.00%) high mild 1 (1.00%) high severe index_of_column_by_name 20 time: [4.1947 ns 4.1963 ns 4.1981 ns] change: [-2.1038% -1.5880% -1.2714%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild index_of_column_by_name 50 time: [34.841 ns 34.851 ns 34.871 ns] change: [-0.2590% -0.1783% -0.0774%] (p = 0.00 < 0.05) Change within noise threshold. Found 13 outliers among 100 measurements (13.00%) 1 (1.00%) low severe 4 (4.00%) low mild 5 (5.00%) high mild 3 (3.00%) high severe index_of_column_by_name 100 time: [88.736 ns 88.927 ns 89.119 ns] change: [+4.6597% +5.0086% +5.3786%] (p = 0.00 < 0.05) Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) low mild 4 (4.00%) high mild index_of_column_by_name 500 time: [403.20 ns 403.70 ns 404.29 ns] change: [+1.5771% +1.6483% +1.7326%] (p = 0.00 < 0.05) Performance has regressed. Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low severe 3 (3.00%) low mild 4 (4.00%) high severe index_of_column_by_name 1000 time: [909.73 ns 910.11 ns 910.48 ns] change: [-2.0626% -1.6648% -1.3588%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild DFSchema::new 10 time: [328.91 ns 329.14 ns 329.38 ns] change: [-0.8652% -0.8013% -0.7418%] (p = 0.00 < 0.05) Change within noise threshold. DFSchema::new 20 time: [725.37 ns 725.93 ns 726.56 ns] change: [+0.4542% +0.5177% +0.5841%] (p = 0.00 < 0.05) Change within noise threshold. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low mild 3 (3.00%) high mild 2 (2.00%) high severe DFSchema::new 50 time: [1.6864 µs 1.6892 µs 1.6924 µs] change: [+1.3382% +1.4765% +1.6362%] (p = 0.00 < 0.05) Performance has regressed. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe DFSchema::new 100 time: [3.4953 µs 3.4965 µs 3.4978 µs] change: [-3.4655% -3.2889% -3.1317%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) low severe 1 (1.00%) high mild 2 (2.00%) high severe DFSchema::new 500 time: [23.470 µs 23.477 µs 23.485 µs] change: [-1.8427% -1.7821% -1.7253%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) high mild 3 (3.00%) high severe DFSchema::new 1000 time: [45.504 µs 45.515 µs 45.528 µs] change: [-2.8088% -2.6555% -2.4933%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe cargo bench 172.06s user 0.50s system 153% cpu 1:52.07 total
This PR
index_of_column_by_name 10 time: [33.607 ns 33.663 ns 33.717 ns] change: [+196.44% +196.92% +197.41%] (p = 0.00 < 0.05) Performance has regressed. index_of_column_by_name 20 time: [21.509 ns 21.522 ns 21.535 ns] change: [+412.46% +412.90% +413.42%] (p = 0.00 < 0.05) Performance has regressed. Found 6 outliers among 100 measurements (6.00%) 2 (2.00%) low mild 3 (3.00%) high mild 1 (1.00%) high severe index_of_column_by_name 50 time: [43.590 ns 43.651 ns 43.713 ns] change: [+24.956% +25.143% +25.325%] (p = 0.00 < 0.05) Performance has regressed. index_of_column_by_name 100 time: [68.349 ns 68.373 ns 68.401 ns] change: [-23.444% -23.221% -22.998%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low severe 2 (2.00%) low mild 4 (4.00%) high mild 1 (1.00%) high severe index_of_column_by_name 500 time: [65.428 ns 65.444 ns 65.461 ns] change: [-83.785% -83.768% -83.752%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 2 (2.00%) low severe 1 (1.00%) low mild 4 (4.00%) high mild 3 (3.00%) high severe index_of_column_by_name 1000 time: [74.167 ns 74.174 ns 74.183 ns] change: [-91.855% -91.850% -91.844%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low severe 1 (1.00%) low mild 3 (3.00%) high mild 3 (3.00%) high severe DFSchema::new 10 time: [956.63 ns 957.20 ns 957.81 ns] change: [+190.77% +191.00% +191.28%] (p = 0.00 < 0.05) Performance has regressed. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe DFSchema::new 20 time: [2.4375 µs 2.4384 µs 2.4393 µs] change: [+235.82% +236.06% +236.36%] (p = 0.00 < 0.05) Performance has regressed. Found 7 outliers among 100 measurements (7.00%) 4 (4.00%) low mild 1 (1.00%) high mild 2 (2.00%) high severe DFSchema::new 50 time: [6.5247 µs 6.5275 µs 6.5303 µs] change: [+287.52% +288.07% +288.63%] (p = 0.00 < 0.05) Performance has regressed. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe DFSchema::new 100 time: [15.298 µs 15.330 µs 15.368 µs] change: [+337.14% +340.86% +347.06%] (p = 0.00 < 0.05) Performance has regressed. Found 15 outliers among 100 measurements (15.00%) 4 (4.00%) low mild 6 (6.00%) high mild 5 (5.00%) high severe DFSchema::new 500 time: [92.211 µs 92.284 µs 92.361 µs] change: [+292.82% +293.14% +293.47%] (p = 0.00 < 0.05) Performance has regressed. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) low mild DFSchema::new 1000 time: [204.70 µs 204.87 µs 205.05 µs] change: [+349.22% +349.78% +350.32%] (p = 0.00 < 0.05) Performance has regressed. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild cargo bench 252.05s user 1.60s system 150% cpu 2:48.82 total
Could you please add summary?
It seems that btree provides an advantage with 100+ cols
Thank you -- I plan to review this more carefully tomorrow
Thank you -- I plan to review this more carefully tomorrow
@alamb I think it's a good idea to introduce user defined cacheprovider for both DFSchema and arrow Schema. It will allow to take benefits from btree and avoid building it when is not necessary. My assumption is that user knows when schema become invalid and can manage it invalidation from the cache
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.