iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Orc: Support row group bloom filters

Open a49a opened this issue 2 years ago • 1 comments

Issue link: https://github.com/apache/iceberg/issues/5218

Add write.orc.bloom.filter.columns and write.orc.bloom.filter.fpp options. Enable these options in ORC class.

a49a avatar Jul 20 '22 09:07 a49a

cc @rdblue @RussellSpitzer @huaxingao @kbendick Please review it, thanks a lot.

a49a avatar Aug 05 '22 02:08 a49a

For the test, shall we also check the read path to make sure bloom filters are there?

huaxingao avatar Aug 19 '22 06:08 huaxingao

For the test, shall we also check the read path to make sure bloom filters are there?

I find the ORC SDK will auto recognize the bloom filter in the following code. Is it necessary to test the logic of ORC itself? IMO, we don't have to test it. There have been some tests in the ORC project. @huaxingao

https://github.com/apache/orc/blob/a49f87d492aa62ec2be2d4bce2fcfe1f53ca05d9/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L890-L912

a49a avatar Aug 22 '22 06:08 a49a

LGTM. Thanks a lot @deadwind4 for working on this! cc @kbendick @rdblue

huaxingao avatar Aug 26 '22 05:08 huaxingao

Thank a lot @huaxingao @kbendick for reviewing this.

a49a avatar Aug 26 '22 06:08 a49a

@kbendick Could we merge it?

a49a avatar Sep 26 '22 07:09 a49a

@deadwind4: I understand, that the ORC supports these filters and tests the functions, but I would still like to see some tests validating, that the filters are there. I could envision a situation where ORC changes the way to support bloom filters, and our tests fails to recognize this. If, and only if, there is an easy way to check it, then it would be good to have a test here as well to validate that the filters are created.

Thanks, Peter

pvary avatar Sep 28 '22 05:09 pvary

@pvary Thank you for your feedback. I have added a test case to validate bloom filters in ORC files and push a new commit. @kbendick @pvary Please review this. Thanks a lot.

a49a avatar Oct 19 '22 08:10 a49a

@deadwind4: Could you please check the failed tests.

Thanks, Peter

pvary avatar Oct 20 '22 07:10 pvary

@pvary All checks have passed after I rebased the code and reran the CI.

a49a avatar Oct 20 '22 09:10 a49a

Thanks @deadwind4 for the PR and @kbendick for the review!

pvary avatar Oct 20 '22 14:10 pvary

ORC_BLOOM_FILTER_COLUMNS this property will work on spark? when i set write.orc.bloom.filter.columns=xx and used spark to write data, i found that bloomfilter had no effect on querying.

chenwyi2 avatar Jul 19 '23 06:07 chenwyi2