vortex icon indicating copy to clipboard operation
vortex copied to clipboard

feat[vortex-dict]: implement ToArrowKernel for DictVTable

Open asubiotto opened this issue 4 months ago • 14 comments

asubiotto avatar Aug 15 '25 14:08 asubiotto

cc @robert3005

asubiotto avatar Aug 15 '25 14:08 asubiotto

Thanks for the reviews. I will address comments and CI on Monday.

asubiotto avatar Aug 17 '25 12:08 asubiotto

I'm not sure how to best fix CI. The core issue is that the consistency tests call out to compute kernels that sometimes use ArrayRef::from_arrow (https://github.com/vortex-data/vortex/blob/d578fe02e147ebef2f6e49596a33ca168ec2425c/vortex-array/src/compute/mask.rs#L134) which isn't implemented for dictionary types. We implemented from arrow conversions as FromArrowArray on DictArray in vortex-dict and can't define it on an ArrayRef due to the orphan rule. I could possibly fix CI by not implementing to_arrow conversions for preferred types but I would like to have this behavior so that ultimately the datafusion VortexSource gives me back dictionary-encoded arrow arrays without requiring two conversions.

asubiotto avatar Aug 18 '25 08:08 asubiotto

Coverage Status

coverage: 87.917% (+0.004%) from 87.913% when pulling ae70d827687c5f832afe1fcb80be6ff6c7c916de on asubiotto:asubiotto-dict into d578fe02e147ebef2f6e49596a33ca168ec2425c on vortex-data:develop.

coveralls avatar Aug 18 '25 14:08 coveralls

In order for us to return dictionaries by default we need to move code around so arrow conversion doesn’t live in the main crate. I will do that today which would let you fix the ci problems

robert3005 avatar Aug 18 '25 14:08 robert3005

I attempted the move and it's a lot more intertwined/involved than I originally assumed. There's some more work required in order to decouple compute from arrays. In the meantime to get the test passing you would need to default to not returning dict array if the type is not specified.

robert3005 avatar Aug 19 '25 16:08 robert3005

Thanks! No problem, this behavior is needed for us so I don't mind waiting to merge this and running off a fork in the meantime.

asubiotto avatar Aug 19 '25 17:08 asubiotto

@robert3005 with the recent IO changes, this change will soon be the only thing requiring us to use a fork of vortex. Is there anything I can do to help with the move to vortex-arrow?

asubiotto avatar Sep 15 '25 10:09 asubiotto

I was halfway through doing this but realised there's a bit of API design change necessary. What we want to do is move compute functions out of the main crate so we don't have to always recompile them and keep them optional. Arrow is just another compute function and that crate cna take dependency on other vortex crates. However, the problem we have is that stats are part of the Array trait and stats are defined using compute functions. Therefore we need to change how stats use compute functions, likely means there's a trait you implement that can delegate to compute functions or not but haven't had time to think through this yet.

robert3005 avatar Sep 15 '25 12:09 robert3005

Actually the aim was to make vortex-arrow its own crate, at which point ToArrow shouldn't be a compute function, and it can just be a regular Rust trait.

gatesn avatar Sep 15 '25 12:09 gatesn

Once you split compute from the main crate you can split arrow from compute

robert3005 avatar Sep 15 '25 13:09 robert3005

CodSpeed Performance Report

Merging #4254 will degrade performances by 44.99%

Comparing asubiotto:asubiotto-dict (d97ff96) with develop (89b236b)

Summary

⚡ 32 improvements
❌ 83 regressions
✅ 1043 untouched
⏩ 617 skipped[^skipped]

:warning: Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
decompress_bitpacking_early_filter[i16, 0.0105] 100.9 µs 113.8 µs -11.34%
decompress_bitpacking_early_filter[i16, 0.01] 99.6 µs 111.7 µs -10.84%
decompress_bitpacking_early_filter[i16, 0.02] 123.8 µs 140.6 µs -11.95%
decompress_bitpacking_early_filter[i32, 0.0105] 107.1 µs 119.8 µs -10.61%
decompress_bitpacking_early_filter[i32, 0.01] 105.5 µs 117.6 µs -10.3%
decompress_bitpacking_early_filter[i32, 0.02] 133.7 µs 150 µs -10.91%
decompress_bitpacking_early_filter[i8, 0.0105] 99.4 µs 112.8 µs -11.85%
decompress_bitpacking_early_filter[i8, 0.01] 98.1 µs 110.8 µs -11.51%
decompress_bitpacking_early_filter[i8, 0.02] 120.7 µs 137.8 µs -12.42%
decompress_bitpacking_early_filter[i8, 0.03] 152 µs 188.5 µs -19.37%
decompress_bitpacking_early_filter[i8, 0.04] 168.5 µs 205.2 µs -17.88%
decompress_bitpacking_early_filter[i8, 0.05] 185.2 µs 222 µs -16.55%
decompress_bitpacking_late_filter[i16, 0.0105] 198.2 µs 154.9 µs +27.93%
decompress_bitpacking_late_filter[i16, 0.01] 198.1 µs 154.8 µs +27.98%
decompress_bitpacking_late_filter[i16, 0.02] 215.6 µs 172.2 µs +25.16%
decompress_bitpacking_late_filter[i16, 0.03] 233.1 µs 189.7 µs +22.84%
decompress_bitpacking_late_filter[i16, 0.04] 250.5 µs 207.1 µs +20.94%
decompress_bitpacking_late_filter[i16, 0.05] 267.9 µs 224.4 µs +19.38%
decompress_bitpacking_late_filter[i8, 0.005] 123.3 µs 139.4 µs -11.56%
decompress_bitpacking_late_filter[i8, 0.0105] 132.9 µs 148.9 µs -10.79%
... ... ... ... ...

:information_source: Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks. [^skipped]: 617 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

codspeed-hq[bot] avatar Nov 17 '25 20:11 codspeed-hq[bot]

Making some progress thanks to @onursatici's recent changes 🥳 Seems like there's now a failure on tpch queries probably due to the preferred encoding of string dictionaries resolving to dict<_, utf8view>. I'll take a look when I have a moment:

thread 'main' panicked at /home/runner/_work/vortex/vortex/bench-vortex/src/benchmark_driver.rs:182:33:
query: 1 failed with: Arrow error: Compute error: Internal Error: Cannot cast Utf8View to StringArray of expected type
Backtrace:
   0: bench_vortex::benchmark_driver::execute_queries::<bench_vortex::tpch::tpch_benchmark::TpcHBenchmark>::{closure#0}::{closure#0}::{closure#0}::{closure#0}

asubiotto avatar Nov 17 '25 20:11 asubiotto

Codecov Report

:x: Patch coverage is 88.23529% with 4 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 84.62%. Comparing base (265c3b6) to head (d97ff96). :warning: Report is 27 commits behind head on develop.

Files with missing lines Patch % Lines
vortex-array/src/arrow/compute/to_arrow/dict.rs 84.61% 4 Missing :warning:

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Nov 27 '25 15:11 codecov[bot]

This PR is now ready for a review again. Everything is green apart from the benchmarks. The main changes since the initial PR are:

  • Coerce dictionary encoded utf8view values to dict<, utf8> since dict<, utf8view> essentially double-encodes dictionaries and isn't well supported by arrow (this was the TPC-H failure, missing arrow cast arm).
  • In compare/numeric, I force canonicalization when falling back to arrow to force dictionary expansion. The benchmark regressions are because of this. I don't think the regressions matter that much since we're talking about arrow fallback and the canonicalization was happening anyway for dict-encoded arrays. I might be wrong though. I can add more special-casing to make the benchmarks look good.

cc @onursatici

asubiotto avatar Dec 01 '25 10:12 asubiotto

Am I right in thinking for you use case you will always be passing an Arrow target DataType?

gatesn avatar Dec 01 '25 13:12 gatesn

Am I right in thinking for you use case you will always be passing an Arrow target DataType?

No. The datafusion VortexOpener converts to the preferred arrow target type. We could plumb it down, but I also don't think it's right for dictionary encoding on disk to be expanded when converting to arrow when not specifying a target type. i.e. dictionary encoding should "prefer" to maintain zero-copy when decoding to arrow.

asubiotto avatar Dec 01 '25 13:12 asubiotto

I didn't think DataFusion handled dictionary / other encodings that weren't == the declared schema? Or am I out of date here?

gatesn avatar Dec 01 '25 13:12 gatesn

I didn't think DataFusion handled dictionary / other encodings that weren't == the declared schema? Or am I out of date here?

You're right, but that's taken care of in the schema adapter step with an arrow batch cast to cast the file batches to the table schema if there is a difference. So it's correct but could be more performant. I think there are two questions here:

  1. Should we pass in a target arrow type from datafusion to vortex? Yes, probably. This will solve our use-case and seems like generally a thing we should do. I agree here.
  2. Should vortex dictionary arrays be able to zero-copy to arrow when no target type is specified? I think they should regardless of datafusion and/or our use-case

asubiotto avatar Dec 01 '25 14:12 asubiotto

Ahhh, gotcha.

  1. Yes 100%
  2. Yes 100%

gatesn avatar Dec 01 '25 14:12 gatesn