datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Introduce `binary_as_string` parquet option

Open goldmedal opened this issue 1 year ago • 13 comments

Which issue does this PR close?

Closes #12788 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

goldmedal avatar Oct 08 '24 15:10 goldmedal

Submitted https://github.com/apache/arrow-rs/pull/6539 for it.

@goldmedal would you be ok if I pushed a change to this PR to temporarily patch arrow-rs to include the fix for https://github.com/apache/arrow-rs/pull/6539 ?

Then we could get this PR ready to go (and I could use it to test with string view on by default)

alamb avatar Oct 10 '24 12:10 alamb

@goldmedal would you be ok if I pushed a change to this PR to temporarily patch arrow-rs to include the fix for apache/arrow-rs#6539 ?

Then we could get this PR ready to go (and I could use it to test with string view on by default)

Sure. feel free to do it. Thanks!

goldmedal avatar Oct 10 '24 12:10 goldmedal

@goldmedal would you be ok if I pushed a change to this PR to temporarily patch arrow-rs to include the fix for apache/arrow-rs#6539 ? Then we could get this PR ready to go (and I could use it to test with string view on by default)

Sure. feel free to do it. Thanks!

I pushed the change in https://github.com/apache/datafusion/pull/12816/commits/d7c35658d3f5cb9606e4952b79f13c12c766a41b

I am about out of time to work on this today, but if no one else gets a chance to do this I'll try and polish this PR up tomorrow

alamb avatar Oct 10 '24 13:10 alamb

I am about out of time to work on this today, but if no one else gets a chance to do this I'll try and polish this PR up tomorrow

I would be able to help finish the remaining work before I sleep today. (Ensure this PR works well)

goldmedal avatar Oct 10 '24 14:10 goldmedal

@alamb I have confirmed this feature works well and added some tests for it. Only some concerns about https://github.com/apache/datafusion/pull/12816#discussion_r1795799303. You can feel free to push any changes if needed.

goldmedal avatar Oct 10 '24 17:10 goldmedal

I am starting to play around with this PR / write some tests. Will post my updates shortly

alamb avatar Oct 11 '24 14:10 alamb

I am running some benchmarks on this PR

alamb avatar Oct 11 '24 16:10 alamb

Here is the performance of this PR. Some queries are slower, some are faster.

I believe once we turn on string view everything will be faster.

--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ feature_12788-binary-as-string-… ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.26ms │                           2.22ms │     no change │
│ QQuery 1     │    38.78ms │                          38.16ms │     no change │
│ QQuery 2     │    92.96ms │                          94.66ms │     no change │
│ QQuery 3     │    98.39ms │                         100.72ms │     no change │
│ QQuery 4     │   928.83ms │                         929.22ms │     no change │
│ QQuery 5     │   973.20ms │                         988.39ms │     no change │
│ QQuery 6     │    33.75ms │                          36.60ms │  1.08x slower │
│ QQuery 7     │    42.31ms │                          42.33ms │     no change │
│ QQuery 8     │  1383.69ms │                        1364.35ms │     no change │
│ QQuery 9     │  1323.14ms │                        1361.55ms │     no change │
│ QQuery 10    │   351.92ms │                         413.62ms │  1.18x slower │
│ QQuery 11    │   400.63ms │                         462.88ms │  1.16x slower │
│ QQuery 12    │  1095.05ms │                        1101.34ms │     no change │
│ QQuery 13    │  1753.33ms │                        1676.37ms │     no change │
│ QQuery 14    │  1220.66ms │                        1253.31ms │     no change │
│ QQuery 15    │  1099.67ms │                        1091.86ms │     no change │
│ QQuery 16    │  2530.60ms │                        2536.06ms │     no change │
│ QQuery 17    │  2299.37ms │                        2317.53ms │     no change │
│ QQuery 18    │  5042.06ms │                        4983.43ms │     no change │
│ QQuery 19    │    94.19ms │                          95.74ms │     no change │
│ QQuery 20    │  1720.68ms │                        1493.36ms │ +1.15x faster │
│ QQuery 21    │  2074.60ms │                        1824.64ms │ +1.14x faster │
│ QQuery 22    │  5257.20ms │                        3143.86ms │ +1.67x faster │
│ QQuery 23    │ 10530.69ms │                       10229.16ms │     no change │
│ QQuery 24    │   590.29ms │                         654.78ms │  1.11x slower │
│ QQuery 25    │   489.54ms │                         523.81ms │  1.07x slower │
│ QQuery 26    │   653.48ms │                         714.03ms │  1.09x slower │
│ QQuery 27    │  2585.81ms │                        2285.59ms │ +1.13x faster │
│ QQuery 28    │ 15372.17ms │                       15562.00ms │     no change │
│ QQuery 29    │   530.70ms │                         529.82ms │     no change │
│ QQuery 30    │  1031.82ms │                        1094.30ms │  1.06x slower │
│ QQuery 31    │  1121.86ms │                        1139.77ms │     no change │
│ QQuery 32    │  4358.25ms │                        4290.06ms │     no change │
│ QQuery 33    │  5154.15ms │                        5209.55ms │     no change │
│ QQuery 34    │  5133.94ms │                        5172.32ms │     no change │
│ QQuery 35    │  1947.97ms │                        1895.07ms │     no change │
│ QQuery 36    │   270.40ms │                         262.39ms │     no change │
│ QQuery 37    │   121.52ms │                         126.73ms │     no change │
│ QQuery 38    │   143.79ms │                         141.60ms │     no change │
│ QQuery 39    │   758.06ms │                         767.78ms │     no change │
│ QQuery 40    │    52.74ms │                          56.09ms │  1.06x slower │
│ QQuery 41    │    48.60ms │                          50.59ms │     no change │
│ QQuery 42    │    64.42ms │                          63.09ms │     no change │
└──────────────┴────────────┴──────────────────────────────────┴───────────────┘

alamb avatar Oct 13 '24 11:10 alamb

I reabased / squashed all the code in this branch so it would be easier to pull in to test in https://github.com/apache/datafusion/pull/12092

alamb avatar Oct 13 '24 12:10 alamb

Here is the performance of this PR. Some queries are slower, some are faster.

I believe once we turn on string view everything will be faster.

Thanks @alamb It's interesting 🤔 Does this benchmark only include the change made by this PR, or does it include others? It seems there are many queries slowed down by this PR.

Before this PR, the casting flow is

Binary(parquet) -> Binary(arrow) -> BinaryView(arrow) -> StringView(arrow)

Now, it's

Binary(paruqet) -> StringView(arrow)

Theoretically, we save the two steps (including the most expensive ones) for it. I have no idea why they would be slower. I might try to do some profiling for the slower cases 🤔

goldmedal avatar Oct 14 '24 02:10 goldmedal

Here is the performance of this PR. Some queries are slower, some are faster. I believe once we turn on string view everything will be faster.

Thanks @alamb It's interesting 🤔 Does this benchmark only include the change made by this PR, or does it include others? It seems there are many queries slowed down by this PR.

It only includes changes made by this PR

The results with several other changes are here: https://github.com/apache/datafusion/pull/12092#issuecomment-2410952786 (and they are all faster 🎉 )

Before this PR, the casting flow is

Binary(parquet) -> Binary(arrow) -> BinaryView(arrow) -> StringView(arrow)

Now, it's

Binary(paruqet) -> StringView(arrow)

Theoretically, we save the two steps (including the most expensive ones) for it. I have no idea why they would be slower. I might try to do some profiling for the slower cases 🤔

I think the reason it is slower is that there are some operations in the hash grouping code that have specializations for StringArray/BinaryArray but do not (yet) have specializations for StringView. Specifically

  • https://github.com/apache/datafusion/pull/12792
  • https://github.com/apache/datafusion/pull/12809 from @Rachelint

So while this PR makes the scan faster, the total time is slower as those paths dominated the query path.

When they are all put together we get the speedup we have been looking for

alamb avatar Oct 14 '24 12:10 alamb

Since this PR requires a change to arrow-rs, I think there is no particular rush to merge it in -- I have a few thoughts about how to make the code a bit simpler and hope to propose some changes over the next few days

alamb avatar Oct 14 '24 12:10 alamb

I am starting to get this PR ready

alamb avatar Oct 16 '24 19:10 alamb

Update here is we are on track to release a version of arrow with the required fixes today and then I will merge this PR up and get it ready for review ⏲️

alamb avatar Oct 24 '24 11:10 alamb

Thanks, @alamb. I have one question about the test. Others look good to me.

Thanks @goldmedal -- that was an excellent catch. I have fixed the issue

alamb avatar Oct 25 '24 13:10 alamb