GH-42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE
This allows to store binary data of arbitrary length in a parquet file, without having to wrongly declare it as UTF-8.
Fixes the writer part of #42971
The reader part has already been fixed in 4d825497cb04c9e1c288000a7a8f75786cc487ff and this uses a similar implementation, but with a stricter set of "exceptions" (only byte arrays with NONE type are allowed).
Rationale for this change
Hello,
We are trying to store binary data (in our case, dump of captured CAN messages) in a parquet file. The data has a variable length (from 0 to 8 bytes) and is not an UTF-8 string (or a text string at all). For this, physical type BYTE_ARRAY and logical type NONE seems appropriate.
Unfortunately, the parquet writer will not let us do that. We can do either fixed length and converted type NONE, or variable length and converted type UTF-8. This change relaxes the type check on byte arrays to allow use of the NONE converted type.
What changes are included in this PR?
Allow the parquet stream writer to store data in a BYTE_ARRAY with NONE logical type. The changes are based to similar changes made earlier to the stream reader.
Are these changes tested?
I'm not sure if this is the right way to fix this problem. I'm happy to add tests if needed after the general idea has been validated.
In particular, the NONE type does not assume ASCII text (with no NULL bytes inside), so the operator<<(const char* v) method may need to be excluded from this (and only allow UTF-8), what do you think? In that case, what would be the way of implementing this without making slightly different versions of CheckColumn for each case?
Are there any user-facing changes?
Parquet stream writer allows using BYTE_ARRAY witn NONE converted type for storage of arbitrary binary data.
- GitHub Issue: #42971
Thanks for opening a pull request!
If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.
Then could you also rename the pull request title in the following format?
GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
In the case of PARQUET issues on JIRA the title also supports:
PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
See also:
mind:
- a test for this
- Mark this pr for parquet stream writer?
Updated the existing test to cover this case, and modified the PR title.
The tests I found does not seem to validate the parquet file contents, is there another place where I could check that?
:warning: GitHub issue #42971 has been automatically assigned in GitHub to PR creator.
Hello, my colleague NicoDbs00 finished my work here, I'm not sure if I need to do something with the github-actions bot to approve his work?
Oops, it seems one our CI setups doesn't like std::basic_string_view<uint8_t>...
https://github.com/apache/arrow/actions/runs/16043863615/job/45271747332?pr=44739#step:7:1125
Perhaps we should use arrow::util::span<const uint8_t> instead. (arrow::util::span is a backport of std::span).
Is it possible to use std::basic_string_view<std::byte> ~~or std::basic_string_view<char>~~?
We don't use std::byte currently in the codebase, so that would seem weird.
Is it possible to use
std::basic_string_view<std::byte>~orstd::basic_string_view<char>~?
generally std::byte, char and uint8_t can be safely casted
Oh, let's ensure the PR description fits in the pre-defined template. Edit: done.
:warning: GitHub issue #42971 has been automatically assigned in GitHub to PR creator.
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 5852134c7a01dc2a2ddab207b684e46af696f086.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details. It also includes information about 105 possible false positives for unstable benchmarks that are known to sometimes produce them.