iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Core: Add KLL Datasketch and Hive ColumnStatisticsObj as standard blo…

Open simhadri-g opened this issue 2 years ago • 12 comments

…b types to puffin file

Issue: https://github.com/apache/iceberg/issues/8198

Hi Everyone,

Hive now supports writing column statistics to puffin files.

The statistics calculated by Hive include histograms, NDV (Number of Distinct Values), Min and Max values, the number of nulls, the number of true values, column name, and column type. You can find the full list of supported stats here: Link to GitHub.

These statistics are stored as a Hive columnStatistics object, which is serialized and saved as a blob in puffin. You can refer to the code here for more information: Link to GitHub.

Currently, this object is supported by Hive and partially by Impala as well: Link to GitHub. We also plan to incorporate the KLL datasketch for histograms.

As a result, we are looking to add columnStatistics object and KLL datasketch as standard blob types for the puffin file. Link to GitHub

Any feedback would be greatly appreciated.

Thanks!

simhadri-g avatar Aug 01 '23 10:08 simhadri-g

Thanks a lot for the review! :)

simhadri-g avatar Aug 01 '23 16:08 simhadri-g

@findepi I have addressed the review comments.

Can you please have a look at the PR when you are free? Thanks in advance! :)

simhadri-g avatar Aug 03 '23 13:08 simhadri-g

Hello @findepi @nastra , I would greatly appreciate it if you could find some time to review the pull request. Thanks!

simhadri-g avatar Aug 16 '23 07:08 simhadri-g

Thanks for the review :) I will update the PR and get back.

simhadri-g avatar Aug 17 '23 09:08 simhadri-g

@nastra , I have updated the PR and moved the iceberg-docs(https://github.com/apache/iceberg-docs/pull/269) changes to the main repo.

Please have a look when you are free.

Thanks again for the review!

simhadri-g avatar Aug 17 '23 09:08 simhadri-g

thanks @nastra for the review! :)

@findepi Could you please have a look as well? Thanks!

simhadri-g avatar Aug 18 '23 19:08 simhadri-g

Any progress update here? It would be great to get the blessing from the Iceberg community for these as supported puffin blob types

ZacBlanco avatar Dec 26 '23 16:12 ZacBlanco

@findepi @danielcweeks could one of you PTAL this? Don't want this effort to get lost.

bitsondatadev avatar Dec 29 '23 14:12 bitsondatadev

@simhadri-g update I'm following up on this today.

bitsondatadev avatar Feb 01 '24 11:02 bitsondatadev

thanks @bitsondatadev !

simhadri-g avatar Feb 01 '24 11:02 simhadri-g

Any update? hope we can continue to push this review forward. Thanks.

zhangbutao avatar Mar 12 '24 14:03 zhangbutao

Hi everyone, I would be most grateful if we could get help with reviewing this. Thanks!

simhadri-g avatar Jun 28 '24 11:06 simhadri-g

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Sep 10 '24 00:09 github-actions[bot]

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions[bot] avatar Sep 18 '24 00:09 github-actions[bot]

hi @simhadri-g, would you mind reopening this PR?

deniskuzZ avatar Feb 03 '25 14:02 deniskuzZ

Hi Denys! I think , I don't have the github permissions to reopen a closed PR in iceberg repo. Should i raise a new PR?

simhadri-g avatar Feb 04 '25 07:02 simhadri-g

Hi Denys! I think , I don't have the github permissions to reopen a closed PR in iceberg repo. Should i raise a new PR?

hey @simhadri-g, if you'd like to pursue this I would recommend opening a new one, but split into 2 parts: Hive ColumnStatistics and KLL Datasketch

deniskuzZ avatar Feb 04 '25 07:02 deniskuzZ

thanks @nastra! It would really help if we could get this in.

deniskuzZ avatar Feb 04 '25 07:02 deniskuzZ

thanks @nastra and @deniskuzZ !

I will resolve the merge conflicts and update this PR. :)

simhadri-g avatar Feb 04 '25 07:02 simhadri-g

@findepi Could you please have a look as well?

nastra avatar Feb 04 '25 11:02 nastra

I would agree with @rdblue's comment that there's a lot of stats overlap with what exists elsewhere and I'm not convinced this serialized format is standardized well enough to be used easily outside of the Hive/Impala projects.

Could we identify the specific stats that we feel are valuable and focus on those independently of what is captured elsewhere?

danielcweeks avatar Feb 04 '25 23:02 danielcweeks

I would agree with @rdblue's comment that there's a lot of stats overlap with what exists elsewhere and I'm not convinced this serialized format is standardized well enough to be used easily outside of the Hive/Impala projects.

Could we identify the specific stats that we feel are valuable and focus on those independently of what is captured elsewhere?

Hi @danielcweeks, Thanks for your reply. I think we might have communicated our proposal not very clearly and had limited knowledge of that area. I've drafted a doc with the proposal: https://docs.google.com/document/d/11Rp-irqb4L4Qpdxr6l83bA4IRsfw3AAyR8wokNe1r80/edit?usp=sharing Hopefully, it would help to understand our intent. Could you please take a look? Thank you!

deniskuzZ avatar Feb 05 '25 13:02 deniskuzZ

hi @nastra, could you please help push this change forward? I've updated the PR to include the KLL sketch only.

deniskuzZ avatar Mar 11 '25 08:03 deniskuzZ

@deniskuzZ this is a spec change and needs to go through a DISCUSS & VOTE thread on the mailing list

nastra avatar Mar 11 '25 08:03 nastra

@deniskuzZ this is a spec change and needs to go through a DISCUSS & VOTE thread on the mailing list

This was a thread: https://lists.apache.org/thread/ws63loz1snsrwtdl7f9yqgr16vncx6w0 I don't know what else I should do other than just abandon the PR. It looks like this change generated some interest until the recent comments, which I hope have been addressed, but I’m not sure since I haven't received any feedback.

deniskuzZ avatar Mar 11 '25 09:03 deniskuzZ

@deniskuzZ you might just want to follow-up on the dev list to revive the discussion. Once it's clear in which direction the proposal is going and what Spec changes are required, you'd go ahead and create a VOTE thread for the Spec changes (this PR) where people can vote on the changes

nastra avatar Mar 11 '25 09:03 nastra