arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17904: [C++] [WIP] Parquet Implement crc in reading and writing Page for DATA_PAGE (v1)

Open mapleFU opened this issue 3 years ago • 7 comments
trafficstars

This patch add crc in writing and reading DATA_PAGE. And crc for dictionary, DATA_PAGE_V2 will be added in comming patches.

  • [x] Implement crc in writing DATA_PAGE
  • [x] Implement crc in reading DATA_PAGE
  • [x] Adding config for write crc page and checking
  • [ ] Testing DATA_PAGE with crc, the testing maybe borrowed from parquet-mr
  • [x] Using crc library in https://issues.apache.org/jira/browse/ARROW-17904

And there is some questions, I found that in thirdparty, arrow imports crc32c, which is extracted from leveldb's crc library. But seems that our standard uses crc32, which has a different magic number. So I vendor implementions mentioned in https://issues.apache.org/jira/browse/ARROW-17904 .

The default config of enable crc in parquet-mr for writer is true, but here I use false, because set it true may slow down writer.

mapleFU avatar Oct 08 '22 06:10 mapleFU

https://issues.apache.org/jira/browse/ARROW-17904

github-actions[bot] avatar Oct 08 '22 06:10 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Oct 08 '22 06:10 github-actions[bot]

Thanks for the PR @mapleFU . I haven't looked in detail yet.

IMHO The testing approach should be to add reference files in https://github.com/apache/parquet-testing/tree/master/data . Ideally these files should provide the given feature matrix:

  • correct and incorrect CRCs
  • v1 and v2 data pages
  • compressed and uncompressed columns

This would probably be spread over two files, one with correct CRCs, one with incorrect CRCs. What do you think?

pitrou avatar Oct 22 '22 15:10 pitrou

I'd like to follow the testing here: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java @pitrou

mapleFU avatar Oct 22 '22 15:10 mapleFU

I'd like to follow the testing here: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java

This actually seems more complicated to implement (you have to replicate details of the file format in the tests). Also, without reference files we wouldn't test that different implementations actually agree on the CRC computation.

pitrou avatar Oct 22 '22 15:10 pitrou

This actually seems more complicated to implement (you have to replicate details of the file format in the tests). Also, without reference files we wouldn't test that different implementations actually agree on the CRC computation.

Sure, let me generate some data files in parquet-mr following TestDataPageV1Checksums first.

mapleFU avatar Oct 22 '22 15:10 mapleFU

@pitrou @kou Hi, so should I put crc to vendored or util?

mapleFU avatar Oct 28 '22 09:10 mapleFU

Since the code is simple and the algorithm well-know, my vote is to own it and keep it in util.

pitrou avatar Oct 29 '22 14:10 pitrou

I like cpp/src/arrow/vendored/ but it's not a strong opinion. So I'm OK with util.

kou avatar Oct 29 '22 22:10 kou

@mapleFU Is this still WIP?

pitrou avatar Dec 01 '22 16:12 pitrou

@pitrou Hi, I tried to add some unittest and pass the tests.

Now I generate some files from https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java , should I first commit them to parquet-test project, and then testing them in C++?

By the way, I'm not familiar with parquet-mr, where does it testing parquet-test? If it has some parquet crc testing, I can just follow it. /cc @wgtmac

mapleFU avatar Dec 04 '22 09:12 mapleFU

@pitrou Hi, I tried to add some unittest and pass the tests.

Now I generate some files from https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java , should I first commit them to parquet-test project, and then testing them in C++?

By the way, I'm not familiar with parquet-mr, where does it testing parquet-test? If it has some parquet crc testing, I can just follow it. /cc @wgtmac

I agree that we can add some test files generated by arrow and parquet-mr to parquet-testing submodule. And then add inter-op tests in both arrow and parquet-mr repos to do cross validation. FYI, I have added inter-op test for LZ4_RAW to parquet-mr: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/codec/TestInteropReadLz4RawCodec.java

wgtmac avatar Dec 04 '22 13:12 wgtmac

The patch is ready to be review now, though the parquet-testing should be merged.

As https://github.com/apache/parquet-testing/pull/29 says, I add 3 files there, if more cases is required, please tell me.

And I also change some calling to PageReader::Open in 6d83725, because they doesn't pass the ReaderProperties to PageReader, which would make reader not check the checksum.

@pitrou

mapleFU avatar Dec 04 '22 13:12 mapleFU

@pitrou I've update the parquet-testing to latest, and the pull request can be review now

mapleFU avatar Dec 06 '22 01:12 mapleFU

  Found the following (possibly) invalid URLs:
    URL: https://arrow.apache.org/docs/r/articles/datasets.html
      From: man/open_dataset.Rd
      Status: 404
      Message: Not Found
    URL: https://arrow.apache.org/docs/r/articles/read_write.html
      From: man/read_json_arrow.Rd
      Status: 404
      Message: Not Found

I found CI for R failed, I think that is not my problem, because it just get 404. @pitrou mind take a look?

mapleFU avatar Dec 07 '22 02:12 mapleFU

Could you open a new GitHub issue for it? We don't need to fix it in this pull request.

kou avatar Dec 07 '22 20:12 kou

Could you open a new GitHub issue for it? We don't need to fix it in this pull request.

Hi kou, I create an issue here: https://github.com/apache/arrow/issues/14884 @kou

And the patch ARROW-17904 can be reviewed now

mapleFU avatar Dec 08 '22 02:12 mapleFU

@pitrou Mind take a look?

mapleFU avatar Dec 12 '22 15:12 mapleFU

Hi, as https://github.com/apache/arrow/pull/14351#discussion_r1046105723 says, I'd like to keep this patch minimal, so only DATA_PAGE_V1 would be checksum here.

Currently, for DATA_PAGE_V2, I left TODO here.

mapleFU avatar Dec 13 '22 06:12 mapleFU

Hi, as #14351 (comment) says, I'd like to keep this patch minimal, so only DATA_PAGE_V1 would be checksum here.

Would you be willing to submit a v2 PR later? It would be a pity if we supported v1 pages better than v2 pages.

pitrou avatar Dec 13 '22 09:12 pitrou

Would you be willing to submit a v2 PR later? It would be a pity if we supported v1 pages better than v2 pages.

Sure, seems that v2 it's a bit more tricker, it should use accumulate crc:

  1. set crc = 0
  2. if rep-level exists, do crc = checksum(rep-levels, 0)
  3. if def-level exists, do crc = checksum(def-levels, crc)
  4. if compressed data exists, do crc = checksum(data, crc)

It's really a bit more trickey here, so I think do it in same patch is complex. Let me finish it later @pitrou

mapleFU avatar Dec 13 '22 09:12 mapleFU

I don't think you need to accumulate CRC, you just have to run the CRC over the entire concatenated data exactly as it's written out on disk (which is rep-levels + def-levels + compressed data).

It would probably be easy to find out, in any case.

pitrou avatar Dec 13 '22 09:12 pitrou

To be clear, I'm ok with doing only v1 here, but I would like to make sure we understand the complexity (or not) of handling v2 pages.

pitrou avatar Dec 13 '22 09:12 pitrou

It would probably be easy to find out, in any case.

@pitrou You're right, I made a mistake and misundering the format here.

The data format in memory is just:

- Data page Layout: Repetition Levels - Definition Levels - encoded values.

So, both calculating and verify should just share the same logic:

crc = checksum(data)

Here. Previously I make a mistake that DATA_PAGE_V2 stores it on header. So sorry for that.

mapleFU avatar Dec 13 '22 10:12 mapleFU

@pitrou most comment are solved, you can take a look.

DATA_PAGE_V2 will be add in the comming patch. And I have a question here: https://github.com/apache/parquet-format/pull/126#issuecomment-1348324323 . Should I implement CRC for DICT_PAGE?

mapleFU avatar Dec 13 '22 11:12 mapleFU

After discussion on parquet-format, the checksum for DICT and DATA_PAGE_V2 will be implemented in the coming patches.

mapleFU avatar Dec 14 '22 02:12 mapleFU

I've tried my best to write more testing to covering different cases, eliminate duplicate code and keep test clean. But maybe I'm not good at writing testings. If you have any ideas, you can just edit it directly. @pitrou

mapleFU avatar Dec 14 '22 17:12 mapleFU

I think I'll do the following adjustments later:

  • check and clean up CRC32 implementation to follow the code style (should be mostly mechanical)
  • since we determined that the checksum may apply to all page kinds, change the option names to not mention data pages :-)

pitrou avatar Dec 15 '22 17:12 pitrou

  • check and clean up CRC32 implementation to follow the code style (should be mostly mechanical)
  • since we determined that the checksum may apply to all page kinds, change the option names to not mention data pages :-)

It's ok, the crc code is pasted from cyrus-imapd . But some code is modified( using arrow's internal library rather than using cyrus-imapd's), you can just diff the code.

Some optimizations like simd or other could be introduced. Perhaps: https://www.boost.org/doc/libs/1_73_0/doc/html/crc/crc_optimal.html . To be honest, I think maintaining a crc implementions pasted from zstd or cyrus-imapd is really trickey. But maybe we can keep my implemention first, at least it's correct :)

As for variable name, seems that parquet-mr uses PageWriteChecksum and PageVerifyChecksum, I think they should be better.

Thank you for your help and your patient.

mapleFU avatar Dec 15 '22 18:12 mapleFU

@pitrou ping

mapleFU avatar Jan 04 '23 02:01 mapleFU