arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE

Open pulkomandy opened this issue 1 year ago • 4 comments

This allows to store binary data of arbitrary length in a parquet file, without having to wrongly declare it as UTF-8.

Fixes the writer part of #42971

The reader part has already been fixed in 4d825497cb04c9e1c288000a7a8f75786cc487ff and this uses a similar implementation, but with a stricter set of "exceptions" (only byte arrays with NONE type are allowed).

Rationale for this change

Hello,

We are trying to store binary data (in our case, dump of captured CAN messages) in a parquet file. The data has a variable length (from 0 to 8 bytes) and is not an UTF-8 string (or a text string at all). For this, physical type BYTE_ARRAY and logical type NONE seems appropriate.

Unfortunately, the parquet writer will not let us do that. We can do either fixed length and converted type NONE, or variable length and converted type UTF-8. This change relaxes the type check on byte arrays to allow use of the NONE converted type.

What changes are included in this PR?

Allow the parquet stream writer to store data in a BYTE_ARRAY with NONE logical type. The changes are based to similar changes made earlier to the stream reader.

Are these changes tested?

I'm not sure if this is the right way to fix this problem. I'm happy to add tests if needed after the general idea has been validated.

In particular, the NONE type does not assume ASCII text (with no NULL bytes inside), so the operator<<(const char* v) method may need to be excluded from this (and only allow UTF-8), what do you think? In that case, what would be the way of implementing this without making slightly different versions of CheckColumn for each case?

Are there any user-facing changes?

Parquet stream writer allows using BYTE_ARRAY witn NONE converted type for storage of arbitrary binary data.

  • GitHub Issue: #42971

pulkomandy avatar Nov 15 '24 12:11 pulkomandy

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions[bot] avatar Nov 15 '24 12:11 github-actions[bot]

mind:

  1. a test for this
  2. Mark this pr for parquet stream writer?

mapleFU avatar Nov 15 '24 12:11 mapleFU

Updated the existing test to cover this case, and modified the PR title.

The tests I found does not seem to validate the parquet file contents, is there another place where I could check that?

pulkomandy avatar Nov 15 '24 12:11 pulkomandy

:warning: GitHub issue #42971 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Nov 17 '24 15:11 github-actions[bot]

Hello, my colleague NicoDbs00 finished my work here, I'm not sure if I need to do something with the github-actions bot to approve his work?

pulkomandy avatar Jul 01 '25 09:07 pulkomandy

Oops, it seems one our CI setups doesn't like std::basic_string_view<uint8_t>... https://github.com/apache/arrow/actions/runs/16043863615/job/45271747332?pr=44739#step:7:1125

Perhaps we should use arrow::util::span<const uint8_t> instead. (arrow::util::span is a backport of std::span).

pitrou avatar Jul 03 '25 07:07 pitrou

Is it possible to use std::basic_string_view<std::byte> ~~or std::basic_string_view<char>~~?

wgtmac avatar Jul 03 '25 08:07 wgtmac

We don't use std::byte currently in the codebase, so that would seem weird.

pitrou avatar Jul 03 '25 08:07 pitrou

Is it possible to use std::basic_string_view<std::byte> ~or std::basic_string_view<char>~?

generally std::byte, char and uint8_t can be safely casted

mapleFU avatar Jul 03 '25 08:07 mapleFU

Oh, let's ensure the PR description fits in the pre-defined template. Edit: done.

pitrou avatar Jul 08 '25 15:07 pitrou

:warning: GitHub issue #42971 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Jul 08 '25 15:07 github-actions[bot]

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 5852134c7a01dc2a2ddab207b684e46af696f086.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 105 possible false positives for unstable benchmarks that are known to sometimes produce them.