iceberg
iceberg copied to clipboard
Orc: Support row group bloom filters
Issue link: https://github.com/apache/iceberg/issues/5218
Add write.orc.bloom.filter.columns
and write.orc.bloom.filter.fpp
options.
Enable these options in ORC class.
cc @rdblue @RussellSpitzer @huaxingao @kbendick Please review it, thanks a lot.
For the test, shall we also check the read path to make sure bloom filters are there?
For the test, shall we also check the read path to make sure bloom filters are there?
I find the ORC SDK will auto recognize the bloom filter in the following code. Is it necessary to test the logic of ORC itself? IMO, we don't have to test it. There have been some tests in the ORC project. @huaxingao
https://github.com/apache/orc/blob/a49f87d492aa62ec2be2d4bce2fcfe1f53ca05d9/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L890-L912
LGTM. Thanks a lot @deadwind4 for working on this! cc @kbendick @rdblue
Thank a lot @huaxingao @kbendick for reviewing this.
@kbendick Could we merge it?
@deadwind4: I understand, that the ORC supports these filters and tests the functions, but I would still like to see some tests validating, that the filters are there. I could envision a situation where ORC changes the way to support bloom filters, and our tests fails to recognize this. If, and only if, there is an easy way to check it, then it would be good to have a test here as well to validate that the filters are created.
Thanks, Peter
@pvary Thank you for your feedback. I have added a test case to validate bloom filters in ORC files and push a new commit. @kbendick @pvary Please review this. Thanks a lot.
@deadwind4: Could you please check the failed tests.
Thanks, Peter
@pvary All checks have passed after I rebased the code and reran the CI.
Thanks @deadwind4 for the PR and @kbendick for the review!
ORC_BLOOM_FILTER_COLUMNS this property will work on spark? when i set write.orc.bloom.filter.columns=xx and used spark to write data, i found that bloomfilter had no effect on querying.