cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Truncate parquet column indexes

Open etseidl opened this issue 3 years ago • 7 comments

Description

Adds truncation of the min and max values of the Parquet column indexes, as recommended by the Parquet format. Adds a parameter column_index_truncate_length to the writer options/builder. It currently defaults to 64, which is the default used by parquet-mr.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

etseidl avatar Jul 29 '22 19:07 etseidl

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

GPUtester avatar Jul 29 '22 19:07 GPUtester

add to allowlist

PointKernel avatar Jul 29 '22 23:07 PointKernel

Codecov Report

:exclamation: No coverage uploaded for pull request base (branch-22.10@d86bb39). Click here to learn what that means. The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.10   #11403   +/-   ##
===============================================
  Coverage                ?   86.36%           
===============================================
  Files                   ?      145           
  Lines                   ?    22950           
  Branches                ?        0           
===============================================
  Hits                    ?    19821           
  Misses                  ?     3129           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Aug 02 '22 20:08 codecov[bot]

@hyperbolic2346 the merge race is on again! #11424 is going to break me here :) Since you're all up in the stats space, could you take a peek, esp at the TODO comment above page_enc.cu:truncate_string()?

etseidl avatar Aug 03 '22 18:08 etseidl

rerun tests

etseidl avatar Aug 03 '22 20:08 etseidl

I think we're getting there. I'll make one more pass over the whole thing tomorrow.

Thanks for all the cycles you're spending on this!

So it occurred to me that I can overcome my std::byte fear and loathing with a few well placed operator overloads.

constexpr std::byte operator| (std::byte lhs, uint8_t rhs) noexcept
{
  return static_cast<std::byte>(static_cast<uint8_t>(lhs) | rhs);
}

constexpr std::byte operator& (std::byte lhs, uint8_t rhs) noexcept
{
  return static_cast<std::byte>(static_cast<uint8_t>(lhs) & rhs);
}

constexpr bool operator==(std::byte lhs, uint8_t rhs) noexcept
{
  return lhs == std::byte{rhs};
}

constexpr bool operator!=(std::byte lhs, uint8_t rhs) noexcept
{
  return lhs != std::byte{rhs};
}

and suddenly it's not so bad. Beyond scope for now, but food for thought. Can't wait for operator<=>

etseidl avatar Aug 04 '22 23:08 etseidl

and suddenly it's not so bad. Beyond scope for now, but food for thought.

That really isn't so bad, but at the end of the day you have to wonder what we gained outside of making someone happy that std::byte was being used.

hyperbolic2346 avatar Aug 05 '22 00:08 hyperbolic2346

@gpucibot merge

PointKernel avatar Aug 22 '22 21:08 PointKernel