arrow
arrow copied to clipboard
ARROW-17904: [C++] [WIP] Parquet Implement crc in reading and writing Page for DATA_PAGE (v1)
This patch add crc in writing and reading DATA_PAGE. And crc for dictionary, DATA_PAGE_V2 will be added in comming patches.
- [x] Implement crc in writing DATA_PAGE
- [x] Implement crc in reading DATA_PAGE
- [x] Adding config for write crc page and checking
- [ ] Testing DATA_PAGE with crc, the testing maybe borrowed from
parquet-mr - [x] Using crc library in https://issues.apache.org/jira/browse/ARROW-17904
And there is some questions, I found that in thirdparty, arrow imports crc32c, which is extracted from leveldb's crc library. But seems that our standard uses crc32, which has a different magic number. So I vendor implementions mentioned in https://issues.apache.org/jira/browse/ARROW-17904 .
The default config of enable crc in parquet-mr for writer is true, but here I use false, because set it true may slow down writer.
https://issues.apache.org/jira/browse/ARROW-17904
:warning: Ticket has not been started in JIRA, please click 'Start Progress'.
Thanks for the PR @mapleFU . I haven't looked in detail yet.
IMHO The testing approach should be to add reference files in https://github.com/apache/parquet-testing/tree/master/data . Ideally these files should provide the given feature matrix:
- correct and incorrect CRCs
- v1 and v2 data pages
- compressed and uncompressed columns
This would probably be spread over two files, one with correct CRCs, one with incorrect CRCs. What do you think?
I'd like to follow the testing here: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java @pitrou
I'd like to follow the testing here: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java
This actually seems more complicated to implement (you have to replicate details of the file format in the tests). Also, without reference files we wouldn't test that different implementations actually agree on the CRC computation.
This actually seems more complicated to implement (you have to replicate details of the file format in the tests). Also, without reference files we wouldn't test that different implementations actually agree on the CRC computation.
Sure, let me generate some data files in parquet-mr following TestDataPageV1Checksums first.
@pitrou @kou Hi, so should I put crc to vendored or util?
Since the code is simple and the algorithm well-know, my vote is to own it and keep it in util.
I like cpp/src/arrow/vendored/ but it's not a strong opinion. So I'm OK with util.
@mapleFU Is this still WIP?
@pitrou Hi, I tried to add some unittest and pass the tests.
Now I generate some files from https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java , should I first commit them to parquet-test project, and then testing them in C++?
By the way, I'm not familiar with parquet-mr, where does it testing parquet-test? If it has some parquet crc testing, I can just follow it. /cc @wgtmac
@pitrou Hi, I tried to add some unittest and pass the tests.
Now I generate some files from https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java , should I first commit them to
parquet-testproject, and then testing them in C++?By the way, I'm not familiar with parquet-mr, where does it testing
parquet-test? If it has some parquet crc testing, I can just follow it. /cc @wgtmac
I agree that we can add some test files generated by arrow and parquet-mr to parquet-testing submodule. And then add inter-op tests in both arrow and parquet-mr repos to do cross validation. FYI, I have added inter-op test for LZ4_RAW to parquet-mr: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/codec/TestInteropReadLz4RawCodec.java
The patch is ready to be review now, though the parquet-testing should be merged.
As https://github.com/apache/parquet-testing/pull/29 says, I add 3 files there, if more cases is required, please tell me.
And I also change some calling to PageReader::Open in 6d83725, because they doesn't pass the ReaderProperties to PageReader, which would make reader not check the checksum.
@pitrou
@pitrou I've update the parquet-testing to latest, and the pull request can be review now
Found the following (possibly) invalid URLs:
URL: https://arrow.apache.org/docs/r/articles/datasets.html
From: man/open_dataset.Rd
Status: 404
Message: Not Found
URL: https://arrow.apache.org/docs/r/articles/read_write.html
From: man/read_json_arrow.Rd
Status: 404
Message: Not Found
I found CI for R failed, I think that is not my problem, because it just get 404. @pitrou mind take a look?
Could you open a new GitHub issue for it? We don't need to fix it in this pull request.
Could you open a new GitHub issue for it? We don't need to fix it in this pull request.
Hi kou, I create an issue here: https://github.com/apache/arrow/issues/14884 @kou
And the patch ARROW-17904 can be reviewed now
@pitrou Mind take a look?
Hi, as https://github.com/apache/arrow/pull/14351#discussion_r1046105723 says, I'd like to keep this patch minimal, so only DATA_PAGE_V1 would be checksum here.
Currently, for DATA_PAGE_V2, I left TODO here.
Hi, as #14351 (comment) says, I'd like to keep this patch minimal, so only DATA_PAGE_V1 would be checksum here.
Would you be willing to submit a v2 PR later? It would be a pity if we supported v1 pages better than v2 pages.
Would you be willing to submit a v2 PR later? It would be a pity if we supported v1 pages better than v2 pages.
Sure, seems that v2 it's a bit more tricker, it should use accumulate crc:
set crc = 0- if rep-level exists, do
crc = checksum(rep-levels, 0) - if def-level exists, do
crc = checksum(def-levels, crc) - if compressed data exists, do
crc = checksum(data, crc)
It's really a bit more trickey here, so I think do it in same patch is complex. Let me finish it later @pitrou
I don't think you need to accumulate CRC, you just have to run the CRC over the entire concatenated data exactly as it's written out on disk (which is rep-levels + def-levels + compressed data).
It would probably be easy to find out, in any case.
To be clear, I'm ok with doing only v1 here, but I would like to make sure we understand the complexity (or not) of handling v2 pages.
It would probably be easy to find out, in any case.
@pitrou You're right, I made a mistake and misundering the format here.
The data format in memory is just:
- Data page Layout: Repetition Levels - Definition Levels - encoded values.
So, both calculating and verify should just share the same logic:
crc = checksum(data)
Here. Previously I make a mistake that DATA_PAGE_V2 stores it on header. So sorry for that.
@pitrou most comment are solved, you can take a look.
DATA_PAGE_V2 will be add in the comming patch. And I have a question here: https://github.com/apache/parquet-format/pull/126#issuecomment-1348324323 . Should I implement CRC for DICT_PAGE?
After discussion on parquet-format, the checksum for DICT and DATA_PAGE_V2 will be implemented in the coming patches.
I've tried my best to write more testing to covering different cases, eliminate duplicate code and keep test clean. But maybe I'm not good at writing testings. If you have any ideas, you can just edit it directly. @pitrou
I think I'll do the following adjustments later:
- check and clean up CRC32 implementation to follow the code style (should be mostly mechanical)
- since we determined that the checksum may apply to all page kinds, change the option names to not mention data pages :-)
- check and clean up CRC32 implementation to follow the code style (should be mostly mechanical)
- since we determined that the checksum may apply to all page kinds, change the option names to not mention data pages :-)
It's ok, the crc code is pasted from cyrus-imapd . But some code is modified( using arrow's internal library rather than using cyrus-imapd's), you can just diff the code.
Some optimizations like simd or other could be introduced. Perhaps: https://www.boost.org/doc/libs/1_73_0/doc/html/crc/crc_optimal.html . To be honest, I think maintaining a crc implementions pasted from zstd or cyrus-imapd is really trickey. But maybe we can keep my implemention first, at least it's correct :)
As for variable name, seems that parquet-mr uses PageWriteChecksum and PageVerifyChecksum, I think they should be better.
Thank you for your help and your patient.
@pitrou ping