Core, Spark: Fix delete with filter on nested columns
Fixes #7065.
This fixes Spark delete data when using a filter on nested columns. Now such operations will fail because Spark calls canDeleteUsingMetadata which uses StrictMetricsEvaluator to evaluate whether a file should be completely deleted, however StrictMetricsEvaluator doesn't support evaluate on nested columns now, and a NPE will be thrown out, see #7065.
This updates StrictMetricsEvaluator to support evaluation on nested columns(only for columns nested in a chain of Struct fileds, will return ROWS_MIGHT_NOT_MATCH if columns are nested in Map or List fields), which solve this problem.
@aokolnychyi @rdblue can you help review this?
PTAL @rdblue @RussellSpitzer @aokolnychyi @szehon-ho
would love to see it merged
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.
This issue is still around in spark 3.5 and would really be a big capability to have for data that is all in structured format
Agreed. Can this be revived, @szehon-ho? Are you able to re-open it, @zhongyujiang?
@blakewhatley82 @mdub I think this fix is incorrect because the null count data of nested columns in metadata might be incorrect for now, see #8611. I am not able to reopen this, I've created a new PR #11261 with a different approach to address this issue.
Fixed by #11261.