GH-41321: [C++][Parquet] More strict Parquet level checking
Rationale for this change
In https://github.com/apache/arrow/issues/41321 , user reports a corrupt when reading from a corrupt parquet file. This is because we lost some checking. Current code works on reading a normal parquet file. But when reading a corrupt file, this need to be more strict.
Currently this patch just enhance the checking on Parquet Level, the correspond value check would be add in later patches
What changes are included in this PR?
More strict parquet checkings on Level
Are these changes tested?
Already exists test, maybe we can introduce parquet file as test file
Are there any user-facing changes?
More strict checkings
- GitHub Issue: #41321
cc @emkornfield @pitrou Would you mind take a look? This patch checks the counting is right in RecordReader
@mapleFU Gentle ping if this PR could be moved forward. This would help me making progress on fuzzing my own code, which is now stopped by crashes in libarrow/libparquet.
Sorrying for delay I just back from holidays, will address the comments tomorrow.
Sigh, adding checks is easy, but maybe reasoning them would be a little tricky, and https://github.com/apache/arrow/commit/77fc23fcae0331da3adf94619a381a371a6e414f adds some ad-hoc checkings. Not ad-hoc checkings are written everywhere in our system. Maybe I can summarize them as:
- Levels Count / num_values not matches the
num_values()in Page Header.LevelDecodercurrently doesn't checks this 😭. But we will check the decodeddef-levelshould equal torep-level, and we can checkavailable_values_current_page() - storing value not matches the level, like in
ReadBatch, we parsed that we have 10 non-null values, but only reads 5. This should be regard as "value not matches level"
@pitrou @wgtmac Would you mind take a look?
@emkornfield @pitrou Would you mind re take a look?
@github-actions crossbow submit -g cpp
Revision: 3d58afeb4e31e054bff6bb99881360a3b8b8233c
Submitted crossbow builds: ursacomputing/crossbow @ actions-c74132eedf
Will merge in one day if no negative comments, since we collect 1 +1 vote
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 1f07404dac920bf81f852f834622f2fc30f8dcfc.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details. It also includes information about 30 possible false positives for unstable benchmarks that are known to sometimes produce them.