iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Core, Spark: Fix delete with filter on nested columns

Open zhongyujiang opened this issue 2 years ago • 4 comments

Fixes #7065.

This fixes Spark delete data when using a filter on nested columns. Now such operations will fail because Spark calls canDeleteUsingMetadata which uses StrictMetricsEvaluator to evaluate whether a file should be completely deleted, however StrictMetricsEvaluator doesn't support evaluate on nested columns now, and a NPE will be thrown out, see #7065.

This updates StrictMetricsEvaluator to support evaluation on nested columns(only for columns nested in a chain of Struct fileds, will return ROWS_MIGHT_NOT_MATCH if columns are nested in Map or List fields), which solve this problem.

zhongyujiang avatar Mar 17 '23 15:03 zhongyujiang

@aokolnychyi @rdblue can you help review this?

zhongyujiang avatar Mar 17 '23 16:03 zhongyujiang

PTAL @rdblue @RussellSpitzer @aokolnychyi @szehon-ho

bluzy avatar Dec 28 '23 02:12 bluzy

would love to see it merged

eshishki avatar Jul 07 '24 12:07 eshishki

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Aug 28 '24 00:08 github-actions[bot]

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions[bot] avatar Sep 05 '24 00:09 github-actions[bot]

This issue is still around in spark 3.5 and would really be a big capability to have for data that is all in structured format

blakewhatley82 avatar Sep 23 '24 08:09 blakewhatley82

Agreed. Can this be revived, @szehon-ho? Are you able to re-open it, @zhongyujiang?

mdub avatar Oct 03 '24 03:10 mdub

@blakewhatley82 @mdub I think this fix is incorrect because the null count data of nested columns in metadata might be incorrect for now, see #8611. I am not able to reopen this, I've created a new PR #11261 with a different approach to address this issue.

zhongyujiang avatar Oct 05 '24 08:10 zhongyujiang

Fixed by #11261.

zhongyujiang avatar Oct 14 '24 07:10 zhongyujiang