arrow GH-41321: [C++][Parquet] More strict Parquet level checking

Rationale for this change

In https://github.com/apache/arrow/issues/41321 , user reports a corrupt when reading from a corrupt parquet file. This is because we lost some checking. Current code works on reading a normal parquet file. But when reading a corrupt file, this need to be more strict.

Currently this patch just enhance the checking on Parquet Level, the correspond value check would be add in later patches

What changes are included in this PR?

More strict parquet checkings on Level

Are these changes tested?

Already exists test, maybe we can introduce parquet file as test file

Are there any user-facing changes?

More strict checkings

GitHub Issue: #41321

Apr 23 '24 06:04 mapleFU

cc @emkornfield @pitrou Would you mind take a look? This patch checks the counting is right in RecordReader

Apr 23 '24 14:04 mapleFU

@mapleFU Gentle ping if this PR could be moved forward. This would help me making progress on fuzzing my own code, which is now stopped by crashes in libarrow/libparquet.

May 07 '24 15:05 rouault

Sorrying for delay I just back from holidays, will address the comments tomorrow.

May 07 '24 15:05 mapleFU

Sigh, adding checks is easy, but maybe reasoning them would be a little tricky, and https://github.com/apache/arrow/commit/77fc23fcae0331da3adf94619a381a371a6e414f adds some ad-hoc checkings. Not ad-hoc checkings are written everywhere in our system. Maybe I can summarize them as:

Levels Count / num_values not matches the num_values() in Page Header. LevelDecoder currently doesn't checks this 😭. But we will check the decoded def-level should equal to rep-level, and we can check available_values_current_page()
storing value not matches the level, like in ReadBatch, we parsed that we have 10 non-null values, but only reads 5. This should be regard as "value not matches level"

May 08 '24 07:05 mapleFU

@pitrou @wgtmac Would you mind take a look?

May 14 '24 04:05 mapleFU

@emkornfield @pitrou Would you mind re take a look?

May 16 '24 09:05 mapleFU

@github-actions crossbow submit -g cpp

May 17 '24 14:05 mapleFU

Revision: 3d58afeb4e31e054bff6bb99881360a3b8b8233c

Submitted crossbow builds: ursacomputing/crossbow @ actions-c74132eedf

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-gcc-14

May 17 '24 14:05 github-actions[bot]

Will merge in one day if no negative comments, since we collect 1 +1 vote

May 20 '24 11:05 mapleFU

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 1f07404dac920bf81f852f834622f2fc30f8dcfc.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 30 possible false positives for unstable benchmarks that are known to sometimes produce them.

May 21 '24 15:05 conbench-apache-arrow[bot]