spark Corrected row index usage when exploding packed arrays in vectorized reader

This PR fixes an issue in the vectorized parquet reader with respect to executing the explode function on nested arrays where the array cuts across two or more pages. It's probably possible to minimize this slightly more but I wasn't able to find a reproducer. It's also worth noting that this issue illustrates a current gap in the lower-level unit tests for the vectorized reader, which don't appear to test much related to output vector offsets.

The bug in question was a simple typo: the output row offset was used to dereference nested array lengths rather than input row offset. This only matters for the explode function and then only when resuming the same operation on a second page. This case (and all related cases) are, at present, untested. I added a high-level test and example .parquet file which reproduces the issue and verifies the fix, but it would be ideal if more tests were added at a lower level. It is very likely that other similar bugs are present within the vectorized reader as it relates to nested substructures remapped during the query pipeline.

What changes were proposed in this pull request?

It's a fairly straightforward typo issue in the code.

Why are the changes needed?

The vectorized parquet reader does not correctly handle this case

Does this PR introduce any user-facing change?

Aside from fixing the vectorized reader? No.

How was this patch tested?

Unit test (well, more of an integration test) included in PR

Was this patch authored or co-authored using generative AI tooling?

Nope

Jun 10 '24 17:06 djspiewak

Should we have a corresponding JIRA ticket for this fix?

Jun 13 '24 09:06 uros-db

Is this being held up by anything? Any JIRA would be a fairly trivial transliteration of the test case that I added. Note the query and the example parquet file. That example does not work today on production Databricks instances (resulting an ArrayIndexOutOfBoundsException).

Jul 16 '24 19:07 djspiewak

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Jan 27 '25 00:01 github-actions[bot]

Is there any particular reason this is languishing? I'd like to make sure this gets fixed.

Jan 27 '25 15:01 djspiewak

@djspiewak How does packed-list-vectorized.parquet generate, and could it be generated on the fly? BTW, I suppose the bug also exists on the master branch, the PR should target the master instead of branch-3.5

May 21 '25 09:05 pan3793

It was generated using an internal tool written in Go using standard libraries. It took a fair bit of minimization of different test files that were breaking Databricks to narrow it down to this one, which is I believe minimal.

It's definitely possible to generate this situation dynamically but that seemed considerably harder than just baking in the reproducer in this fashion.

May 21 '25 19:05 djspiewak

@djspiewak the ASF projects highly discourage including binary files(e.g. class files, jars, data files that can not be edited via a text editor) in the source release/codebase, and all files used for testing should be reproducible

There is a recent discussion about removing testing jars from the codebase https://lists.apache.org/thread/0ro5yn6lbbpmvmqp2px3s2pf7cwljlc4

May 22 '25 02:05 pan3793

This is a fair stance, though I'll note that the repository already has many violations of this so far as I can tell. If this is considered a blocker to merge I can work on generating the test data automatically.

May 22 '25 02:05 djspiewak

@djspiewak Indeed, it is clear that there are relevant counterexamples in the current Spark repository. However, the Spark developers have taken proactive measures to address this issue, as evidenced by the following pr:

https://github.com/apache/spark/pull/50378
https://github.com/apache/spark/pull/50422
https://github.com/apache/spark/pull/50790

I believe that this problem will be completely eradicated in the future. Therefore, it would be preferable if you could optimize the testing process by generating the necessary Parquet files as test cases during the testing phase. Thanks ~

May 22 '25 05:05 LuciferYang

@LuciferYang Slightly different situation here as jars and class files are not the same as parquet. Is there an existing example of how to generate test parquet files? As I noted, there are already a significant number of such files so I assume this has been discussed before.

If nothing has been standardized I can work on sorting that out just trying to avoid redundancy here.

May 22 '25 14:05 djspiewak

I've retargeted this pull request and updated the commits. I also took a moment to reexamine the test to see what can be done about the binary file. It's important to understand that this bug manifests specifically when exploding across a row group boundary. That's doable without encoding the test data as parquet, but it's not particularly easy or obvious to do. I'd also like to reiterate that most of the test suite I modified is using checked in parquet files.

Again, happy to change this if there's already a pre-existing pattern for how to write this sort of test (or intrinsically generate this sort of data), but given the nature of the bug, the present diff seems optimal to me.

May 22 '25 22:05 djspiewak

Apache Spark repository takes advantage of the GitHub resource of contributor fork to avoid the lack of resources. Could you setup GitHub action on your repository?

https://spark.apache.org/contributing.html

As of now, this PR seems to fail like the following.

Ref: bug/packed-list-vectorized
SHA: 466266f25cbf86ef5e810344754aba2af7e61c80
Error: There was a new unsynced commit pushed. Please retrigger the workflow.
    at eval (eval at callAsyncFunction (/home/runner/work/_actions/actions/github-script/v6/dist/index.js:15143:16), <anonymous>:74:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async main (/home/runner/work/_actions/actions/github-script/v6/dist/index.js:15236:20)
Error: Unhandled error: Error: There was a new unsynced commit pushed. Please retrigger the workflow.

May 22 '25 23:05 dongjoon-hyun

cc @sunchao

May 22 '25 23:05 dongjoon-hyun

I've retargeted this pull request and updated the commits. I also took a moment to reexamine the test to see what can be done about the binary file. It's important to understand that this bug manifests specifically when exploding across a row group boundary. That's doable without encoding the test data as parquet, but it's not particularly easy or obvious to do. I'd also like to reiterate that most of the test suite I modified is using checked in parquet files.

Again, happy to change this if there's already a pre-existing pattern for how to write this sort of test (or intrinsically generate this sort of data), but given the nature of the bug, the present diff seems optimal to me.

If there are no objections from other Spark PMCs, I accept this testing proposal.

May 23 '25 07:05 LuciferYang

Oh lol I actively disabled the actions on my fork. I'll sort that out

May 23 '25 16:05 djspiewak

@dongjoon-hyun I've reenabled actions. Please let me know if I need to reconfigure anything else on my end

May 23 '25 16:05 djspiewak

@djspiewak Could you please rebase this one? I manually restarted the GitHub Action jobs, but it still failed. Thanks

May 23 '25 16:05 LuciferYang

Merged to master/branch-3.5, thanks!

May 25 '25 16:05 sunchao

Thank you all.

May 25 '25 17:05 dongjoon-hyun