Enable reading StringView by default from Parquet (`schema_force_string_view`) by default
Part of https://github.com/apache/datafusion/issues/11752
Is your feature request related to a problem or challenge?
As part of https://github.com/apache/datafusion/issues/10918, @XiangpengHao has threaded the use of StringView through parquet, arrow-rs and then into DataFusion
When the datafusion.execution.parquet.schema_force_string_view option is enabled, the DataFusion Parquet reader will read all Utf8 columns as StringView instead, which results in significantly faster performance (details TBD but we will write it down in https://github.com/apache/datafusion/issues/11603 )
However, when initially merged https://github.com/apache/datafusion/pull/11667 this setting will be off by default
This ticket tracks what it would take to turn the setting on by default
Describe the solution you'd like
Change the default value of datafusion.execution.parquet.schema_force_string_view to true
Describe alternatives you've considered
Basically we should enable the flag by default and then run some benchmarks to ensure performance doesn't change by too much
Additional context
No response
Want to share my numbers here:
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ Baseline ┃ StringView┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 0.40ms │ 0.41ms │ no change │
│ QQuery 1 │ 46.60ms │ 42.69ms │ +1.09x faster │
│ QQuery 2 │ 76.16ms │ 78.07ms │ no change │
│ QQuery 3 │ 87.25ms │ 85.44ms │ no change │
│ QQuery 4 │ 774.81ms │ 770.28ms │ no change │
│ QQuery 5 │ 888.38ms │ 916.04ms │ no change │
│ QQuery 6 │ 41.07ms │ 40.60ms │ no change │
│ QQuery 7 │ 44.55ms │ 44.30ms │ no change │
│ QQuery 8 │ 1229.92ms │ 1220.85ms │ no change │
│ QQuery 9 │ 891.96ms │ 873.84ms │ no change │
│ QQuery 10 │ 490.90ms │ 220.19ms │ +2.23x faster │
│ QQuery 11 │ 513.23ms │ 241.88ms │ +2.12x faster │
│ QQuery 12 │ 1130.10ms │ 950.93ms │ +1.19x faster │
│ QQuery 13 │ 2371.24ms │ 2204.60ms │ +1.08x faster │
│ QQuery 14 │ 1499.27ms │ 1377.36ms │ +1.09x faster │
│ QQuery 15 │ 888.89ms │ 878.98ms │ no change │
│ QQuery 16 │ 2602.96ms │ 2638.78ms │ no change │
│ QQuery 17 │ 2515.57ms │ 2580.58ms │ no change │
│ QQuery 18 │ 5577.86ms │ 5814.67ms │ no change │
│ QQuery 19 │ 76.79ms │ 77.22ms │ no change │
│ QQuery 20 │ 1133.65ms │ 850.76ms │ +1.33x faster │
│ QQuery 21 │ 1532.25ms │ 1049.88ms │ +1.46x faster │
│ QQuery 22 │ 3490.42ms │ 2880.90ms │ +1.21x faster │
│ QQuery 23 │ 10056.49ms │ 9152.26ms │ +1.10x faster │
│ QQuery 24 │ 649.17ms │ 494.05ms │ +1.31x faster │
│ QQuery 25 │ 567.48ms │ 449.79ms │ +1.26x faster │
│ QQuery 26 │ 690.33ms │ 555.21ms │ +1.24x faster │
│ QQuery 27 │ 1771.53ms │ 1526.66ms │ +1.16x faster │
│ QQuery 28 │ 9406.74ms │ 8802.03ms │ +1.07x faster │
│ QQuery 29 │ 353.43ms │ 362.44ms │ no change │
│ QQuery 30 │ 1186.41ms │ 1067.44ms │ +1.11x faster │
│ QQuery 31 │ 1617.60ms │ 1515.93ms │ +1.07x faster │
│ QQuery 32 │ 7992.19ms │ 7823.69ms │ no change │
│ QQuery 33 │ 4809.44ms │ 3374.01ms │ +1.43x faster │
│ QQuery 34 │ 4779.28ms │ 3405.84ms │ +1.40x faster │
│ QQuery 35 │ 1504.81ms │ 1505.37ms │ no change │
│ QQuery 36 │ 150.45ms │ 145.05ms │ no change │
│ QQuery 37 │ 112.40ms │ 97.54ms │ +1.15x faster │
│ QQuery 38 │ 101.89ms │ 95.53ms │ +1.07x faster │
│ QQuery 39 │ 515.90ms │ 455.99ms │ +1.13x faster │
│ QQuery 40 │ 52.28ms │ 49.23ms │ +1.06x faster │
│ QQuery 41 │ 46.79ms │ 46.00ms │ no change │
│ QQuery 42 │ 52.62ms │ 53.40ms │ no change │
└──────────────┴────────────┴───────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (results) │ 74321.46ms │
│ Total Time (results) │ 66816.69ms │
│ Average Time (results) │ 1728.41ms │
│ Average Time (results) │ 1553.88ms │
│ Queries Faster │ 23 │
│ Queries Slower │ 0 │
│ Queries with No Change │ 20 │
└────────────────────────┴────────────┘
take
Want to share my numbers here:
Update is here are the items I think are blocking us from enabling StringView
- [x] https://github.com/apache/datafusion/issues/6906
- [ ] https://github.com/apache/datafusion/issues/12788
- [ ] https://github.com/apache/datafusion/issues/12771
I am going to try and make https://github.com/apache/datafusion/issues/6906 work now
Update is here are the items I think are blocking us from enabling StringView
Update is: I have an implementation of https://github.com/apache/datafusion/issues/6906 and thanks to @goldmedal we have an implementation of https://github.com/apache/datafusion/issues/12788 almost ready to test
The final piece I know of is https://github.com/apache/datafusion/issues/12771 and @Rachelint has a good PR https://github.com/apache/datafusion/pull/12809 for that
Update: we have enough of the pieces implemented thanks to @Rachelint and @goldmedal and @jayzhan211 so I have hacked it together in a branch and am now running the performance tests to try and get a final end to end performance numbers. I think we are (very) close
See https://github.com/apache/datafusion/pull/12092#issuecomment-2408950500 for details
This is basically blocked on the next arrow-rs release https://github.com/apache/arrow-rs/issues/6341 which is blocking https://github.com/apache/datafusion/issues/12788