feat[vortex-dict]: implement ToArrowKernel for DictVTable
cc @robert3005
Thanks for the reviews. I will address comments and CI on Monday.
I'm not sure how to best fix CI. The core issue is that the consistency tests call out to compute kernels that sometimes use ArrayRef::from_arrow (https://github.com/vortex-data/vortex/blob/d578fe02e147ebef2f6e49596a33ca168ec2425c/vortex-array/src/compute/mask.rs#L134) which isn't implemented for dictionary types. We implemented from arrow conversions as FromArrowArray on DictArray in vortex-dict and can't define it on an ArrayRef due to the orphan rule. I could possibly fix CI by not implementing to_arrow conversions for preferred types but I would like to have this behavior so that ultimately the datafusion VortexSource gives me back dictionary-encoded arrow arrays without requiring two conversions.
coverage: 87.917% (+0.004%) from 87.913% when pulling ae70d827687c5f832afe1fcb80be6ff6c7c916de on asubiotto:asubiotto-dict into d578fe02e147ebef2f6e49596a33ca168ec2425c on vortex-data:develop.
In order for us to return dictionaries by default we need to move code around so arrow conversion doesn’t live in the main crate. I will do that today which would let you fix the ci problems
I attempted the move and it's a lot more intertwined/involved than I originally assumed. There's some more work required in order to decouple compute from arrays. In the meantime to get the test passing you would need to default to not returning dict array if the type is not specified.
Thanks! No problem, this behavior is needed for us so I don't mind waiting to merge this and running off a fork in the meantime.
@robert3005 with the recent IO changes, this change will soon be the only thing requiring us to use a fork of vortex. Is there anything I can do to help with the move to vortex-arrow?
I was halfway through doing this but realised there's a bit of API design change necessary. What we want to do is move compute functions out of the main crate so we don't have to always recompile them and keep them optional. Arrow is just another compute function and that crate cna take dependency on other vortex crates. However, the problem we have is that stats are part of the Array trait and stats are defined using compute functions. Therefore we need to change how stats use compute functions, likely means there's a trait you implement that can delegate to compute functions or not but haven't had time to think through this yet.
Actually the aim was to make vortex-arrow its own crate, at which point ToArrow shouldn't be a compute function, and it can just be a regular Rust trait.
Once you split compute from the main crate you can split arrow from compute
CodSpeed Performance Report
Merging #4254 will degrade performances by 44.99%
Comparing asubiotto:asubiotto-dict (d97ff96) with develop (89b236b)
Summary
⚡ 32 improvements
❌ 83 regressions
✅ 1043 untouched
⏩ 617 skipped[^skipped]
:warning: Please fix the performance issues or acknowledge them on CodSpeed.
Benchmarks breakdown
:information_source: Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks. [^skipped]: 617 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.
Making some progress thanks to @onursatici's recent changes 🥳 Seems like there's now a failure on tpch queries probably due to the preferred encoding of string dictionaries resolving to dict<_, utf8view>. I'll take a look when I have a moment:
thread 'main' panicked at /home/runner/_work/vortex/vortex/bench-vortex/src/benchmark_driver.rs:182:33:
query: 1 failed with: Arrow error: Compute error: Internal Error: Cannot cast Utf8View to StringArray of expected type
Backtrace:
0: bench_vortex::benchmark_driver::execute_queries::<bench_vortex::tpch::tpch_benchmark::TpcHBenchmark>::{closure#0}::{closure#0}::{closure#0}::{closure#0}
Codecov Report
:x: Patch coverage is 88.23529% with 4 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 84.62%. Comparing base (265c3b6) to head (d97ff96).
:warning: Report is 27 commits behind head on develop.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| vortex-array/src/arrow/compute/to_arrow/dict.rs | 84.61% | 4 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
This PR is now ready for a review again. Everything is green apart from the benchmarks. The main changes since the initial PR are:
- Coerce dictionary encoded utf8view values to dict<, utf8> since dict<, utf8view> essentially double-encodes dictionaries and isn't well supported by arrow (this was the TPC-H failure, missing arrow cast arm).
- In compare/numeric, I force canonicalization when falling back to arrow to force dictionary expansion. The benchmark regressions are because of this. I don't think the regressions matter that much since we're talking about arrow fallback and the canonicalization was happening anyway for dict-encoded arrays. I might be wrong though. I can add more special-casing to make the benchmarks look good.
cc @onursatici
Am I right in thinking for you use case you will always be passing an Arrow target DataType?
Am I right in thinking for you use case you will always be passing an Arrow target DataType?
No. The datafusion VortexOpener converts to the preferred arrow target type. We could plumb it down, but I also don't think it's right for dictionary encoding on disk to be expanded when converting to arrow when not specifying a target type. i.e. dictionary encoding should "prefer" to maintain zero-copy when decoding to arrow.
I didn't think DataFusion handled dictionary / other encodings that weren't == the declared schema? Or am I out of date here?
I didn't think DataFusion handled dictionary / other encodings that weren't == the declared schema? Or am I out of date here?
You're right, but that's taken care of in the schema adapter step with an arrow batch cast to cast the file batches to the table schema if there is a difference. So it's correct but could be more performant. I think there are two questions here:
- Should we pass in a target arrow type from datafusion to vortex? Yes, probably. This will solve our use-case and seems like generally a thing we should do. I agree here.
- Should vortex dictionary arrays be able to zero-copy to arrow when no target type is specified? I think they should regardless of datafusion and/or our use-case
Ahhh, gotcha.
- Yes 100%
- Yes 100%