arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-42102: [C++][Parquet] Add binary that extracts a footer from a parquet file

Open alkis opened this issue 1 year ago • 11 comments
trafficstars

Rationale for this change

This binary will make it a lot easier for customers to share their parquet metadata with the community so that we can build a repository of footers that can be used for advancing the state of metadata in parquet.

What changes are included in this PR?

Usage from the file binary itself:

Usage: parquet-dump-footer
  -h|--help    Print help and exit
  --no-scrub   Do not scrub potentially PII metadata
  --json       Output JSON instead of binary
  --in         Input file: required
  --out        Output file: defaults to stdout

  Dumps the footer of a Parquet file to stdout or a file, optionally with
  potentially PII metadata scrubbed.

Are these changes tested?

Manually on existing parquet files.

Are there any user-facing changes?

No.

  • GitHub Issue: #42102

alkis avatar Jun 17 '24 12:06 alkis

:warning: GitHub issue #42102 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Jun 17 '24 12:06 github-actions[bot]

Windows build keeps failing. Could you advice? It seems it can't find library arrow_static to link to.

alkis avatar Jun 19 '24 11:06 alkis

Just a high-level comment on usability. I don't think any binaries are currently packaged in release channels so if we want this to be easier to use we should consider embedding it in the library with python bindings

emkornfield avatar Jun 20 '24 16:06 emkornfield

Just a high-level comment on usability. I don't think any binaries are currently packaged in release channels so if we want this to be easier to use we should consider embedding it in the library with python bindings

You mean writing this in python and shipping as a python script?

alkis avatar Jun 20 '24 17:06 alkis

I don't think any binaries are currently packaged in release channels

FYI: deb and RPM include binaries:

  • https://github.com/apache/arrow/blob/main/dev/tasks/linux-packages/apache-arrow/debian/parquet-tools.install
  • https://github.com/apache/arrow/blob/a01fe038df1b6e939d94423b773bcb2ff4a75b0d/dev/tasks/linux-packages/apache-arrow/yum/arrow.spec.in#L531

kou avatar Jun 20 '24 21:06 kou

I don't think any binaries are currently packaged in release channels

FYI: deb and RPM include binaries:

  • https://github.com/apache/arrow/blob/main/dev/tasks/linux-packages/apache-arrow/debian/parquet-tools.install
  • https://github.com/apache/arrow/blob/a01fe038df1b6e939d94423b773bcb2ff4a75b0d/dev/tasks/linux-packages/apache-arrow/yum/arrow.spec.in#L531

I didn't realize this. I guess I was thinking of the pyarrow/R libraries as the primary distribution source.

You mean writing this in python and shipping as a python script?

I was thinking of just include the functionality in the pyarrow arrow library by including C++ as part of the library and wrapping in python. given that there is some packaging that does include the binaries this might be less of an issue.

emkornfield avatar Jun 20 '24 22:06 emkornfield

Given that tools are packages in deb and rpm this covers the majority of linux.

alkis avatar Jun 21 '24 06:06 alkis

Is this ready for merge?

alkis avatar Jun 25 '24 08:06 alkis

General looks ok to me, so this patch would extract a footer with "scrub" might be used?

Yes, the purpose is to build a repository of large/slow to parse footers and guide the design of better parquet metadata representation.

alkis avatar Jun 25 '24 20:06 alkis

Friendly ping. Can this be merged?

alkis avatar Jul 02 '24 08:07 alkis

Will invite @emkornfield @pitrou for comment

Though style is a bit weird ( Maybe we should also support output to arrow-fs, so that we can extract a new footer from s3 file to a new s3 file), but this is a good start for me

mapleFU avatar Jul 02 '24 08:07 mapleFU

Friendly ping. @emkornfield @pitrou could you take a look?

alkis avatar Jul 10 '24 15:07 alkis

Looks like CI failure might be related to this change:

FAILED: debug/parquet-dump-footer 
: && /usr/lib/ccache/c++ -Wredundant-move -Wno-noexcept-type  -fdiagnostics-color=always  -Wall -Wno-conversion -Wno-sign-conversion -Wdate-time -Wunused-result -fno-semantic-interposition -msse4.2  -g -Werror -O0 -ggdb -g1  tools/parquet/CMakeFiles/parquet-dump-footer.dir/parquet_dump_footer.cc.o -o debug/parquet-dump-footer  -Wl,-rpath,/build/cpp/debug:  debug/libparquet.so.1700.0.0  /usr/lib/x86_64-linux-gnu/libthrift.so  -larrow_static  debug/libarrow.so.1700.0.0 && :
/usr/bin/ld: cannot find -larrow_static: No such file or directory
collect2: error: ld returned 1 exit status

Generally seems OK, asked a few questions for clarity and some suggestions on using existing functionality.

emkornfield avatar Jul 11 '24 04:07 emkornfield

Looks like CI failure might be related to this change:

FAILED: debug/parquet-dump-footer 
: && /usr/lib/ccache/c++ -Wredundant-move -Wno-noexcept-type  -fdiagnostics-color=always  -Wall -Wno-conversion -Wno-sign-conversion -Wdate-time -Wunused-result -fno-semantic-interposition -msse4.2  -g -Werror -O0 -ggdb -g1  tools/parquet/CMakeFiles/parquet-dump-footer.dir/parquet_dump_footer.cc.o -o debug/parquet-dump-footer  -Wl,-rpath,/build/cpp/debug:  debug/libparquet.so.1700.0.0  /usr/lib/x86_64-linux-gnu/libthrift.so  -larrow_static  debug/libarrow.so.1700.0.0 && :
/usr/bin/ld: cannot find -larrow_static: No such file or directory
collect2: error: ld returned 1 exit status

Generally seems OK, asked a few questions for clarity and some suggestions on using existing functionality.

@kou do you understand why this fails in CI?

alkis avatar Jul 17 '24 05:07 alkis

I think that the build doesn't have -DARROW_BUILD_STATIC=ON.

BTW, we should not use ${ARROW_LIBRARIES}. That is arrow_shared, arrow_static or arrow_shared;arrow_static. If it's arrow_shared;arrow_static, both of them are linked.

Why did you use https://github.com/apache/arrow/pull/42174/commits/7d896e49b0de076e2010fa50fff3b4c23788144b ?

kou avatar Jul 17 '24 07:07 kou

Could you fix lint failure? https://github.com/apache/arrow/actions/runs/9968954644/job/27547291192?pr=42174#step:6:30

You can use pre-commit run --all.

kou avatar Jul 17 '24 07:07 kou

I think that the build doesn't have -DARROW_BUILD_STATIC=ON.

BTW, we should not use ${ARROW_LIBRARIES}. That is arrow_shared, arrow_static or arrow_shared;arrow_static. If it's arrow_shared;arrow_static, both of them are linked.

Why did you use 7d896e4 ?

I thought that's the right way to do it, I can revert that commit if the conditional is the right way.

alkis avatar Jul 17 '24 07:07 alkis

Could you try reverting the commit?

kou avatar Jul 17 '24 07:07 kou

Could you try reverting the commit?

There are link errors now on windows and macos: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/50227715. Any help on cmake would be greatly appreciated.

alkis avatar Jul 17 '24 08:07 alkis

Could you try linking parquet_shared/parquet_static instead of arrow_shared/arrow_static?

kou avatar Jul 17 '24 20:07 kou

Could you try linking parquet_shared/parquet_static instead of arrow_shared/arrow_static?

parquet_shared/parquet_static are linked for all tools a few lines above: https://github.com/apache/arrow/pull/42174/files#diff-94a0cc5cd65c55d279af9f4d93dd7d45cb2ad1bb4b12c463c3a3023e32b6c94dR24-R28 so I removed the array_shared/arrow_static stanza.

alkis avatar Jul 18 '24 07:07 alkis

Could you try linking parquet_shared/parquet_static instead of arrow_shared/arrow_static?

parquet_shared/parquet_static are linked for all tools a few lines above: https://github.com/apache/arrow/pull/42174/files#diff-94a0cc5cd65c55d279af9f4d93dd7d45cb2ad1bb4b12c463c3a3023e32b6c94dR24-R28 so I removed the array_shared/arrow_static stanza.

Still fails CI.

alkis avatar Jul 18 '24 07:07 alkis

Hmm. It seems that auto generated classes from parquet.thrift such as parquet::format:SizeStatistics don't have __declspec(dllexport)/__declspec(dllimport) for Windows. So they can't be used out of libparquet.

Do we need to use them?

kou avatar Jul 18 '24 08:07 kou

Hmm. It seems that auto generated classes from parquet.thrift such as parquet::format:SizeStatistics don't have __declspec(dllexport)/__declspec(dllimport) for Windows. So they can't be used out of libparquet.

Do we need to use them?

Oh interesting find. Yes unfortunately we need them. Otherwise we might miss scrubbing strings. Can we workaround this? Perhaps only create this binary on static builds?

alkis avatar Jul 18 '24 08:07 alkis

It will work. Could you try it?

kou avatar Jul 18 '24 09:07 kou

It will work. Could you try it?

That worked! Thank you @kou.

alkis avatar Jul 18 '24 13:07 alkis

@github-actions crossbow submit -g cpp

pitrou avatar Jul 22 '24 12:07 pitrou

Revision: 9659477618aa19e5b9d946c19412427b0e30efb8

Submitted crossbow builds: ursacomputing/crossbow @ actions-e904ab8147

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

github-actions[bot] avatar Jul 22 '24 12:07 github-actions[bot]

I tried this locally and it fails on all the Parquet files I tried to throw at it. Example:

$ /build/build-test/debug/parquet-dump-footer --json --in `pwd`/parquet-testing/data/alltypes_plain.parquet
terminate called after throwing an instance of 'parquet::ParquetException'
  what():  Couldn't deserialize thrift: TProtocolException: Invalid data

Abandon

pitrou avatar Jul 22 '24 12:07 pitrou

I'll push a fix for that.

pitrou avatar Jul 22 '24 12:07 pitrou