Corrected row index usage when exploding packed arrays in vectorized reader
This PR fixes an issue in the vectorized parquet reader with respect to executing the explode function on nested arrays where the array cuts across two or more pages. It's probably possible to minimize this slightly more but I wasn't able to find a reproducer. It's also worth noting that this issue illustrates a current gap in the lower-level unit tests for the vectorized reader, which don't appear to test much related to output vector offsets.
The bug in question was a simple typo: the output row offset was used to dereference nested array lengths rather than input row offset. This only matters for the explode function and then only when resuming the same operation on a second page. This case (and all related cases) are, at present, untested. I added a high-level test and example .parquet file which reproduces the issue and verifies the fix, but it would be ideal if more tests were added at a lower level. It is very likely that other similar bugs are present within the vectorized reader as it relates to nested substructures remapped during the query pipeline.
What changes were proposed in this pull request?
It's a fairly straightforward typo issue in the code.
Why are the changes needed?
The vectorized parquet reader does not correctly handle this case
Does this PR introduce any user-facing change?
Aside from fixing the vectorized reader? No.
How was this patch tested?
Unit test (well, more of an integration test) included in PR
Was this patch authored or co-authored using generative AI tooling?
Nope
Should we have a corresponding JIRA ticket for this fix?
Is this being held up by anything? Any JIRA would be a fairly trivial transliteration of the test case that I added. Note the query and the example parquet file. That example does not work today on production Databricks instances (resulting an ArrayIndexOutOfBoundsException).
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!
Is there any particular reason this is languishing? I'd like to make sure this gets fixed.
@djspiewak How does packed-list-vectorized.parquet generate, and could it be generated on the fly? BTW, I suppose the bug also exists on the master branch, the PR should target the master instead of branch-3.5
It was generated using an internal tool written in Go using standard libraries. It took a fair bit of minimization of different test files that were breaking Databricks to narrow it down to this one, which is I believe minimal.
It's definitely possible to generate this situation dynamically but that seemed considerably harder than just baking in the reproducer in this fashion.
@djspiewak the ASF projects highly discourage including binary files(e.g. class files, jars, data files that can not be edited via a text editor) in the source release/codebase, and all files used for testing should be reproducible
There is a recent discussion about removing testing jars from the codebase https://lists.apache.org/thread/0ro5yn6lbbpmvmqp2px3s2pf7cwljlc4
This is a fair stance, though I'll note that the repository already has many violations of this so far as I can tell. If this is considered a blocker to merge I can work on generating the test data automatically.
@djspiewak Indeed, it is clear that there are relevant counterexamples in the current Spark repository. However, the Spark developers have taken proactive measures to address this issue, as evidenced by the following pr:
- https://github.com/apache/spark/pull/50378
- https://github.com/apache/spark/pull/50422
- https://github.com/apache/spark/pull/50790
I believe that this problem will be completely eradicated in the future. Therefore, it would be preferable if you could optimize the testing process by generating the necessary Parquet files as test cases during the testing phase. Thanks ~
@LuciferYang Slightly different situation here as jars and class files are not the same as parquet. Is there an existing example of how to generate test parquet files? As I noted, there are already a significant number of such files so I assume this has been discussed before.
If nothing has been standardized I can work on sorting that out just trying to avoid redundancy here.
I've retargeted this pull request and updated the commits. I also took a moment to reexamine the test to see what can be done about the binary file. It's important to understand that this bug manifests specifically when exploding across a row group boundary. That's doable without encoding the test data as parquet, but it's not particularly easy or obvious to do. I'd also like to reiterate that most of the test suite I modified is using checked in parquet files.
Again, happy to change this if there's already a pre-existing pattern for how to write this sort of test (or intrinsically generate this sort of data), but given the nature of the bug, the present diff seems optimal to me.
Apache Spark repository takes advantage of the GitHub resource of contributor fork to avoid the lack of resources. Could you setup GitHub action on your repository?
- https://spark.apache.org/contributing.html
As of now, this PR seems to fail like the following.
Ref: bug/packed-list-vectorized
SHA: 466266f25cbf86ef5e810344754aba2af7e61c80
Error: There was a new unsynced commit pushed. Please retrigger the workflow.
at eval (eval at callAsyncFunction (/home/runner/work/_actions/actions/github-script/v6/dist/index.js:15143:16), <anonymous>:74:11)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async main (/home/runner/work/_actions/actions/github-script/v6/dist/index.js:15236:20)
Error: Unhandled error: Error: There was a new unsynced commit pushed. Please retrigger the workflow.
cc @sunchao
I've retargeted this pull request and updated the commits. I also took a moment to reexamine the test to see what can be done about the binary file. It's important to understand that this bug manifests specifically when exploding across a row group boundary. That's doable without encoding the test data as parquet, but it's not particularly easy or obvious to do. I'd also like to reiterate that most of the test suite I modified is using checked in parquet files.
Again, happy to change this if there's already a pre-existing pattern for how to write this sort of test (or intrinsically generate this sort of data), but given the nature of the bug, the present diff seems optimal to me.
If there are no objections from other Spark PMCs, I accept this testing proposal.
Oh lol I actively disabled the actions on my fork. I'll sort that out
@dongjoon-hyun I've reenabled actions. Please let me know if I need to reconfigure anything else on my end
@djspiewak Could you please rebase this one? I manually restarted the GitHub Action jobs, but it still failed. Thanks
Merged to master/branch-3.5, thanks!
Thank you all.