arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-41321: [C++][Parquet] More strict Parquet level checking

Open mapleFU opened this issue 1 year ago • 9 comments

Rationale for this change

In https://github.com/apache/arrow/issues/41321 , user reports a corrupt when reading from a corrupt parquet file. This is because we lost some checking. Current code works on reading a normal parquet file. But when reading a corrupt file, this need to be more strict.

Currently this patch just enhance the checking on Parquet Level, the correspond value check would be add in later patches

What changes are included in this PR?

More strict parquet checkings on Level

Are these changes tested?

Already exists test, maybe we can introduce parquet file as test file

Are there any user-facing changes?

More strict checkings

  • GitHub Issue: #41321

mapleFU avatar Apr 23 '24 06:04 mapleFU

cc @emkornfield @pitrou Would you mind take a look? This patch checks the counting is right in RecordReader

mapleFU avatar Apr 23 '24 14:04 mapleFU

@mapleFU Gentle ping if this PR could be moved forward. This would help me making progress on fuzzing my own code, which is now stopped by crashes in libarrow/libparquet.

rouault avatar May 07 '24 15:05 rouault

Sorrying for delay I just back from holidays, will address the comments tomorrow.

mapleFU avatar May 07 '24 15:05 mapleFU

Sigh, adding checks is easy, but maybe reasoning them would be a little tricky, and https://github.com/apache/arrow/commit/77fc23fcae0331da3adf94619a381a371a6e414f adds some ad-hoc checkings. Not ad-hoc checkings are written everywhere in our system. Maybe I can summarize them as:

  1. Levels Count / num_values not matches the num_values() in Page Header. LevelDecoder currently doesn't checks this 😭. But we will check the decoded def-level should equal to rep-level, and we can check available_values_current_page()
  2. storing value not matches the level, like in ReadBatch, we parsed that we have 10 non-null values, but only reads 5. This should be regard as "value not matches level"

mapleFU avatar May 08 '24 07:05 mapleFU

@pitrou @wgtmac Would you mind take a look?

mapleFU avatar May 14 '24 04:05 mapleFU

@emkornfield @pitrou Would you mind re take a look?

mapleFU avatar May 16 '24 09:05 mapleFU

@github-actions crossbow submit -g cpp

mapleFU avatar May 17 '24 14:05 mapleFU

Revision: 3d58afeb4e31e054bff6bb99881360a3b8b8233c

Submitted crossbow builds: ursacomputing/crossbow @ actions-c74132eedf

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

github-actions[bot] avatar May 17 '24 14:05 github-actions[bot]

Will merge in one day if no negative comments, since we collect 1 +1 vote

mapleFU avatar May 20 '24 11:05 mapleFU

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 1f07404dac920bf81f852f834622f2fc30f8dcfc.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 30 possible false positives for unstable benchmarks that are known to sometimes produce them.