arrow
arrow copied to clipboard
GH-42102: [C++][Parquet] Add binary that extracts a footer from a parquet file
Rationale for this change
This binary will make it a lot easier for customers to share their parquet metadata with the community so that we can build a repository of footers that can be used for advancing the state of metadata in parquet.
What changes are included in this PR?
Usage from the file binary itself:
Usage: parquet-dump-footer
-h|--help Print help and exit
--no-scrub Do not scrub potentially PII metadata
--json Output JSON instead of binary
--in Input file: required
--out Output file: defaults to stdout
Dumps the footer of a Parquet file to stdout or a file, optionally with
potentially PII metadata scrubbed.
Are these changes tested?
Manually on existing parquet files.
Are there any user-facing changes?
No.
- GitHub Issue: #42102
:warning: GitHub issue #42102 has been automatically assigned in GitHub to PR creator.
Windows build keeps failing. Could you advice? It seems it can't find library arrow_static to link to.
Just a high-level comment on usability. I don't think any binaries are currently packaged in release channels so if we want this to be easier to use we should consider embedding it in the library with python bindings
Just a high-level comment on usability. I don't think any binaries are currently packaged in release channels so if we want this to be easier to use we should consider embedding it in the library with python bindings
You mean writing this in python and shipping as a python script?
I don't think any binaries are currently packaged in release channels
FYI: deb and RPM include binaries:
- https://github.com/apache/arrow/blob/main/dev/tasks/linux-packages/apache-arrow/debian/parquet-tools.install
- https://github.com/apache/arrow/blob/a01fe038df1b6e939d94423b773bcb2ff4a75b0d/dev/tasks/linux-packages/apache-arrow/yum/arrow.spec.in#L531
I don't think any binaries are currently packaged in release channels
FYI: deb and RPM include binaries:
- https://github.com/apache/arrow/blob/main/dev/tasks/linux-packages/apache-arrow/debian/parquet-tools.install
- https://github.com/apache/arrow/blob/a01fe038df1b6e939d94423b773bcb2ff4a75b0d/dev/tasks/linux-packages/apache-arrow/yum/arrow.spec.in#L531
I didn't realize this. I guess I was thinking of the pyarrow/R libraries as the primary distribution source.
You mean writing this in python and shipping as a python script?
I was thinking of just include the functionality in the pyarrow arrow library by including C++ as part of the library and wrapping in python. given that there is some packaging that does include the binaries this might be less of an issue.
Given that tools are packages in deb and rpm this covers the majority of linux.
Is this ready for merge?
General looks ok to me, so this patch would extract a footer with "scrub" might be used?
Yes, the purpose is to build a repository of large/slow to parse footers and guide the design of better parquet metadata representation.
Friendly ping. Can this be merged?
Will invite @emkornfield @pitrou for comment
Though style is a bit weird ( Maybe we should also support output to arrow-fs, so that we can extract a new footer from s3 file to a new s3 file), but this is a good start for me
Friendly ping. @emkornfield @pitrou could you take a look?
Looks like CI failure might be related to this change:
FAILED: debug/parquet-dump-footer
: && /usr/lib/ccache/c++ -Wredundant-move -Wno-noexcept-type -fdiagnostics-color=always -Wall -Wno-conversion -Wno-sign-conversion -Wdate-time -Wunused-result -fno-semantic-interposition -msse4.2 -g -Werror -O0 -ggdb -g1 tools/parquet/CMakeFiles/parquet-dump-footer.dir/parquet_dump_footer.cc.o -o debug/parquet-dump-footer -Wl,-rpath,/build/cpp/debug: debug/libparquet.so.1700.0.0 /usr/lib/x86_64-linux-gnu/libthrift.so -larrow_static debug/libarrow.so.1700.0.0 && :
/usr/bin/ld: cannot find -larrow_static: No such file or directory
collect2: error: ld returned 1 exit status
Generally seems OK, asked a few questions for clarity and some suggestions on using existing functionality.
Looks like CI failure might be related to this change:
FAILED: debug/parquet-dump-footer : && /usr/lib/ccache/c++ -Wredundant-move -Wno-noexcept-type -fdiagnostics-color=always -Wall -Wno-conversion -Wno-sign-conversion -Wdate-time -Wunused-result -fno-semantic-interposition -msse4.2 -g -Werror -O0 -ggdb -g1 tools/parquet/CMakeFiles/parquet-dump-footer.dir/parquet_dump_footer.cc.o -o debug/parquet-dump-footer -Wl,-rpath,/build/cpp/debug: debug/libparquet.so.1700.0.0 /usr/lib/x86_64-linux-gnu/libthrift.so -larrow_static debug/libarrow.so.1700.0.0 && : /usr/bin/ld: cannot find -larrow_static: No such file or directory collect2: error: ld returned 1 exit statusGenerally seems OK, asked a few questions for clarity and some suggestions on using existing functionality.
@kou do you understand why this fails in CI?
I think that the build doesn't have -DARROW_BUILD_STATIC=ON.
BTW, we should not use ${ARROW_LIBRARIES}. That is arrow_shared, arrow_static or arrow_shared;arrow_static. If it's arrow_shared;arrow_static, both of them are linked.
Why did you use https://github.com/apache/arrow/pull/42174/commits/7d896e49b0de076e2010fa50fff3b4c23788144b ?
Could you fix lint failure? https://github.com/apache/arrow/actions/runs/9968954644/job/27547291192?pr=42174#step:6:30
You can use pre-commit run --all.
I think that the build doesn't have
-DARROW_BUILD_STATIC=ON.BTW, we should not use
${ARROW_LIBRARIES}. That isarrow_shared,arrow_staticorarrow_shared;arrow_static. If it'sarrow_shared;arrow_static, both of them are linked.Why did you use 7d896e4 ?
I thought that's the right way to do it, I can revert that commit if the conditional is the right way.
Could you try reverting the commit?
Could you try reverting the commit?
There are link errors now on windows and macos: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/50227715. Any help on cmake would be greatly appreciated.
Could you try linking parquet_shared/parquet_static instead of arrow_shared/arrow_static?
Could you try linking
parquet_shared/parquet_staticinstead ofarrow_shared/arrow_static?
parquet_shared/parquet_static are linked for all tools a few lines above: https://github.com/apache/arrow/pull/42174/files#diff-94a0cc5cd65c55d279af9f4d93dd7d45cb2ad1bb4b12c463c3a3023e32b6c94dR24-R28 so I removed the array_shared/arrow_static stanza.
Could you try linking
parquet_shared/parquet_staticinstead ofarrow_shared/arrow_static?
parquet_shared/parquet_staticare linked for all tools a few lines above: https://github.com/apache/arrow/pull/42174/files#diff-94a0cc5cd65c55d279af9f4d93dd7d45cb2ad1bb4b12c463c3a3023e32b6c94dR24-R28 so I removed thearray_shared/arrow_staticstanza.
Still fails CI.
Hmm. It seems that auto generated classes from parquet.thrift such as parquet::format:SizeStatistics don't have __declspec(dllexport)/__declspec(dllimport) for Windows. So they can't be used out of libparquet.
Do we need to use them?
Hmm. It seems that auto generated classes from
parquet.thriftsuch asparquet::format:SizeStatisticsdon't have__declspec(dllexport)/__declspec(dllimport)for Windows. So they can't be used out oflibparquet.Do we need to use them?
Oh interesting find. Yes unfortunately we need them. Otherwise we might miss scrubbing strings. Can we workaround this? Perhaps only create this binary on static builds?
It will work. Could you try it?
It will work. Could you try it?
That worked! Thank you @kou.
@github-actions crossbow submit -g cpp
Revision: 9659477618aa19e5b9d946c19412427b0e30efb8
Submitted crossbow builds: ursacomputing/crossbow @ actions-e904ab8147
I tried this locally and it fails on all the Parquet files I tried to throw at it. Example:
$ /build/build-test/debug/parquet-dump-footer --json --in `pwd`/parquet-testing/data/alltypes_plain.parquet
terminate called after throwing an instance of 'parquet::ParquetException'
what(): Couldn't deserialize thrift: TProtocolException: Invalid data
Abandon
I'll push a fix for that.