datafusion
datafusion copied to clipboard
Introduce `binary_as_string` parquet option
Which issue does this PR close?
Closes #12788 .
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
Submitted https://github.com/apache/arrow-rs/pull/6539 for it.
@goldmedal would you be ok if I pushed a change to this PR to temporarily patch arrow-rs to include the fix for https://github.com/apache/arrow-rs/pull/6539 ?
Then we could get this PR ready to go (and I could use it to test with string view on by default)
@goldmedal would you be ok if I pushed a change to this PR to temporarily patch arrow-rs to include the fix for apache/arrow-rs#6539 ?
Then we could get this PR ready to go (and I could use it to test with string view on by default)
Sure. feel free to do it. Thanks!
@goldmedal would you be ok if I pushed a change to this PR to temporarily patch arrow-rs to include the fix for apache/arrow-rs#6539 ? Then we could get this PR ready to go (and I could use it to test with string view on by default)
Sure. feel free to do it. Thanks!
I pushed the change in https://github.com/apache/datafusion/pull/12816/commits/d7c35658d3f5cb9606e4952b79f13c12c766a41b
I am about out of time to work on this today, but if no one else gets a chance to do this I'll try and polish this PR up tomorrow
I am about out of time to work on this today, but if no one else gets a chance to do this I'll try and polish this PR up tomorrow
I would be able to help finish the remaining work before I sleep today. (Ensure this PR works well)
@alamb I have confirmed this feature works well and added some tests for it. Only some concerns about https://github.com/apache/datafusion/pull/12816#discussion_r1795799303. You can feel free to push any changes if needed.
I am starting to play around with this PR / write some tests. Will post my updates shortly
I am running some benchmarks on this PR
Here is the performance of this PR. Some queries are slower, some are faster.
I believe once we turn on string view everything will be faster.
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main_base ┃ feature_12788-binary-as-string-… ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 2.26ms │ 2.22ms │ no change │
│ QQuery 1 │ 38.78ms │ 38.16ms │ no change │
│ QQuery 2 │ 92.96ms │ 94.66ms │ no change │
│ QQuery 3 │ 98.39ms │ 100.72ms │ no change │
│ QQuery 4 │ 928.83ms │ 929.22ms │ no change │
│ QQuery 5 │ 973.20ms │ 988.39ms │ no change │
│ QQuery 6 │ 33.75ms │ 36.60ms │ 1.08x slower │
│ QQuery 7 │ 42.31ms │ 42.33ms │ no change │
│ QQuery 8 │ 1383.69ms │ 1364.35ms │ no change │
│ QQuery 9 │ 1323.14ms │ 1361.55ms │ no change │
│ QQuery 10 │ 351.92ms │ 413.62ms │ 1.18x slower │
│ QQuery 11 │ 400.63ms │ 462.88ms │ 1.16x slower │
│ QQuery 12 │ 1095.05ms │ 1101.34ms │ no change │
│ QQuery 13 │ 1753.33ms │ 1676.37ms │ no change │
│ QQuery 14 │ 1220.66ms │ 1253.31ms │ no change │
│ QQuery 15 │ 1099.67ms │ 1091.86ms │ no change │
│ QQuery 16 │ 2530.60ms │ 2536.06ms │ no change │
│ QQuery 17 │ 2299.37ms │ 2317.53ms │ no change │
│ QQuery 18 │ 5042.06ms │ 4983.43ms │ no change │
│ QQuery 19 │ 94.19ms │ 95.74ms │ no change │
│ QQuery 20 │ 1720.68ms │ 1493.36ms │ +1.15x faster │
│ QQuery 21 │ 2074.60ms │ 1824.64ms │ +1.14x faster │
│ QQuery 22 │ 5257.20ms │ 3143.86ms │ +1.67x faster │
│ QQuery 23 │ 10530.69ms │ 10229.16ms │ no change │
│ QQuery 24 │ 590.29ms │ 654.78ms │ 1.11x slower │
│ QQuery 25 │ 489.54ms │ 523.81ms │ 1.07x slower │
│ QQuery 26 │ 653.48ms │ 714.03ms │ 1.09x slower │
│ QQuery 27 │ 2585.81ms │ 2285.59ms │ +1.13x faster │
│ QQuery 28 │ 15372.17ms │ 15562.00ms │ no change │
│ QQuery 29 │ 530.70ms │ 529.82ms │ no change │
│ QQuery 30 │ 1031.82ms │ 1094.30ms │ 1.06x slower │
│ QQuery 31 │ 1121.86ms │ 1139.77ms │ no change │
│ QQuery 32 │ 4358.25ms │ 4290.06ms │ no change │
│ QQuery 33 │ 5154.15ms │ 5209.55ms │ no change │
│ QQuery 34 │ 5133.94ms │ 5172.32ms │ no change │
│ QQuery 35 │ 1947.97ms │ 1895.07ms │ no change │
│ QQuery 36 │ 270.40ms │ 262.39ms │ no change │
│ QQuery 37 │ 121.52ms │ 126.73ms │ no change │
│ QQuery 38 │ 143.79ms │ 141.60ms │ no change │
│ QQuery 39 │ 758.06ms │ 767.78ms │ no change │
│ QQuery 40 │ 52.74ms │ 56.09ms │ 1.06x slower │
│ QQuery 41 │ 48.60ms │ 50.59ms │ no change │
│ QQuery 42 │ 64.42ms │ 63.09ms │ no change │
└──────────────┴────────────┴──────────────────────────────────┴───────────────┘
I reabased / squashed all the code in this branch so it would be easier to pull in to test in https://github.com/apache/datafusion/pull/12092
Here is the performance of this PR. Some queries are slower, some are faster.
I believe once we turn on string view everything will be faster.
Thanks @alamb It's interesting 🤔 Does this benchmark only include the change made by this PR, or does it include others? It seems there are many queries slowed down by this PR.
Before this PR, the casting flow is
Binary(parquet) -> Binary(arrow) -> BinaryView(arrow) -> StringView(arrow)
Now, it's
Binary(paruqet) -> StringView(arrow)
Theoretically, we save the two steps (including the most expensive ones) for it. I have no idea why they would be slower. I might try to do some profiling for the slower cases 🤔
Here is the performance of this PR. Some queries are slower, some are faster. I believe once we turn on string view everything will be faster.
Thanks @alamb It's interesting 🤔 Does this benchmark only include the change made by this PR, or does it include others? It seems there are many queries slowed down by this PR.
It only includes changes made by this PR
The results with several other changes are here: https://github.com/apache/datafusion/pull/12092#issuecomment-2410952786 (and they are all faster 🎉 )
Before this PR, the casting flow is
Binary(parquet) -> Binary(arrow) -> BinaryView(arrow) -> StringView(arrow)Now, it's
Binary(paruqet) -> StringView(arrow)Theoretically, we save the two steps (including the most expensive ones) for it. I have no idea why they would be slower. I might try to do some profiling for the slower cases 🤔
I think the reason it is slower is that there are some operations in the hash grouping code that have specializations for StringArray/BinaryArray but do not (yet) have specializations for StringView. Specifically
- https://github.com/apache/datafusion/pull/12792
- https://github.com/apache/datafusion/pull/12809 from @Rachelint
So while this PR makes the scan faster, the total time is slower as those paths dominated the query path.
When they are all put together we get the speedup we have been looking for
Since this PR requires a change to arrow-rs, I think there is no particular rush to merge it in -- I have a few thoughts about how to make the code a bit simpler and hope to propose some changes over the next few days
I am starting to get this PR ready
Update here is we are on track to release a version of arrow with the required fixes today and then I will merge this PR up and get it ready for review ⏲️
Thanks, @alamb. I have one question about the test. Others look good to me.
Thanks @goldmedal -- that was an excellent catch. I have fixed the issue