iceberg
iceberg copied to clipboard
Spark: Add custom metric for number of deletes applied by a SparkScan
This is an extension of #4395. Here we add a custom metric for the number of delete rows that have been applied in a scan of a format v2 table.
We introduce a counter in BatchDataReader
and RowDataReader
, that is incremented when a delete is applied. This counter is passed into DeleteFilter
, and in the cases where we construct a PositionDeleteIndex
, is passed into the implementation of the PositionDeleteIndex
. In all the read paths, the counter is incremented whenever a delete is applied. When Spark calls currentMetricsValues()
on a PartitionReader
, which is a subclass of either BatchDataReader
or RowDataReader
, we get the current value of the counter and return that.
Tested manually by creating a format v2 table using each of Parquet, ORC, and Avro files, deleting and updating rows in the tables, and reading from the table. The expected number of deletes show up in the Spark UI.
Also extended existing unit tests (DeleteReadTests
) to count the number of deletes applied during the scan.
![Screen Shot 2022-04-19 at 8 59 31 PM](https://user-images.githubusercontent.com/3925490/164147620-7085eafd-304d-45d1-aa53-0c6029638d48.png)
@rdblue @RussellSpitzer would you please review?
@RussellSpitzer thank you for reviewing! This PR is on top of #4395, but since that is not merged, all its changes are here. I'll clean up the logging. It is true that their original purpose was for debugging/testing, but I think some of them are still potentially helpful.
@RussellSpitzer I have cleaned up the logging, and reworked the unit tests, on top of the updates to #4395.
@RussellSpitzer please review again when you can. Thanks.
@flyrain could you take a look at this?
@kbendick and @RussellSpitzer thank you for your reviews and suggestions.
I have removed DeleteCounter
from BitmapPositionDeleteIndex
. I have implemented the suggestion to extract the custom metric from Spark after running a DataFrame query (although this turned out to be trickier than expected due to async issues (that did not surface locally but surfaced in the CI)). I think the PR is much better now.
I think we would also really benefit from @flyrain doing a full review on this as well especially now that we have the "markDelete" pathway as well. I assume for that we probably will just skip counting deletes since we don't really care.
@RussellSpitzer I saw that Deletes
and DeleteFilter
have changed in master since I updated this PR (so this PR no longer merges), but I haven't had the time to understand the changes and how to reconcile this PR with them.
I don't understand your statement that "we probably will just skip counting deletes since we don't really care." Are you saying that this PR is no longer worthwhile??
I think we would also really benefit from @flyrain doing a full review on this as well especially now that we have the "markDelete" pathway as well. I assume for that we probably will just skip counting deletes since we don't really care.
@RussellSpitzer I saw that
Deletes
andDeleteFilter
have changed in master since I updated this PR (so this PR no longer merges), but I haven't had the time to understand the changes and how to reconcile this PR with them. I don't understand your statement that "we probably will just skip counting deletes since we don't really care." Are you saying that this PR is no longer worthwhile??
No! I think this PR is very important, there is just now a second read path with _is_deleted
metadata column which actually returns deleted rows. When we follow that path I'm not sure it's important for us to count the deleted rows since we'll be returning them anyway, I could go either way. For the path when we don't actually return deleted rows (the normal read path) we definitely want this metric!
I think we would also really benefit from @flyrain doing a full review on this as well especially now that we have the "markDelete" pathway as well. I assume for that we probably will just skip counting deletes since we don't really care.
@RussellSpitzer I saw that
Deletes
andDeleteFilter
have changed in master since I updated this PR (so this PR no longer merges), but I haven't had the time to understand the changes and how to reconcile this PR with them. I don't understand your statement that "we probably will just skip counting deletes since we don't really care." Are you saying that this PR is no longer worthwhile??No! I think this PR is very important, there is just now a second read path with
_is_deleted
metadata column which actually returns deleted rows. When we follow that path I'm not sure it's important for us to count the deleted rows since we'll be returning them anyway, I could go either way. For the path when we don't actually return deleted rows (the normal read path) we definitely want this metric!
Thanks for the explanation. I haven't followed the changes that recently went in. Your explanation helps put them in context for me. I'll rebase on master and try to address your current feedback.
Sorry, I haven't got chance to review. The changes of class Deletes
and DeleteFilter
are introduced by my PR #4683, which can read deleted rows by using metadata column _deleted
. It may affect how we collect the deletes metrics. Would recommend to take a look at #4683, @wypoon. I will also review this PR a bit later.
Agreed with @RussellSpitzer , we probably don't need the metrics if a user queries all deleted rows (enabled by #4683). It is more useful if when user query a data file with deletes applied since the number of deletes are hidden from users. I'm also worried about the perf implication. Would recommend to benchmark the potential perf loss.
Benchmarking could be good, although I'm gonna be real surprised if it's that heavy. I'll take bets on undetectable difference ... I'm assuming all the function calls will end up getting inlined during runtime optimizations
I'm working on rebasing on master, reconciling the conflicts with #4683.
@RussellSpitzer @flyrain I have rebased on master and reconciled the conflicts with #4683. Counting deletes only happen for the normal read path, not the new path added by #4683.
Russell, I have adopted your suggestions for TestSparkReaderDeletes
. For the other questions/points you raised, I have answered them earlier.
I don't think this work has any performance impact. The costs of incrementing an in-memory counter are dwarfed by the costs of opening and reading delete files.
@flyrain thanks for reviewing. I'll look into your comments.
I have been on a break. Just returning to this. I'm rebasing on master since there are conflicts. I'll follow up after that.
I ran IcebergSourceParquetPosDeleteBenchmark
and IcebergSourceParquetEqDeleteBenchmark
on the state of master when I last rebased and on my last change. The results for readIceberg
and readIcebergVectorized
are as follows:
There is hardly any difference for IcebergSourceParquetPosDeleteBenchmark
.
For IcebergSourceParquetEqDeleteBenchmark
, the numbers show greater difference (some worse and some better, comparing after to before), but due to the much greater variances, I think the differences are within the variances.
I'm going to rebase on master again and conform to the new spotless code style so that this can merge.
@RussellSpitzer @flyrain I have rebased on master. @flyrain I have fixed the bug you pointed out; thank you for pointing it out. (As a result of fixing the bug, it became unnecessary to change PositionDeleteIndex
.)
Please review this again. I kept the changes in Spark 3.2, since I was rebasing. I can port the changes to Spark 3.3 in this same PR (and am inclined to do so), if the changes look good to you.
I have ported the changes to Spark 3.3. The Spark 3.3 changes are identical to the Spark 3.2 changes (the affected files are identical at this point).
@flyrain thanks for reviewing this again. I responded to your comments. Aside from the issue regarding counting the deletes for the case where there is a _is_deleted metadata column, I'll go ahead and make the changes.
@RussellSpitzer @flyrain I have updated the PR.
I went ahead and implemented the necessary changes so that the number of deletes applied is counted whether or not there is a _deleted
metadata column in the read.
I have also addressed the other feedback.
@flyrain thank you for reviewing. I hope this can be merged now.
Great. Thanks @wypoon for working on this. Hi @RussellSpitzer and @kbendick, do you want to take another look and approve it? There is an authentication issue from my side that I still don't have the permission to merge even though I'm a committer now. Hopefully, my authentication issue could be resolved next week.
Great. Thanks @wypoon for working on this. Hi @RussellSpitzer and @kbendick, do you want to take another look and approve it? There is an authentication issue from my side that I still don't have the permission to merge even though I'm a committer now. Hopefully, my authentication issue could be resolved next week.
Yeah sure let me do another pass. Congrats again on becoming a committer!
Thanks @kbendick!
@kbendick thanks for reviewing. I have addressed some minor nits.
@flyrain @kbendick since you have approved this, can one of you please merge it? (The only update since is one adding some blank lines after if blocks and tweaking of comments in the test class.)
@wypoon, sorry for the delay, working on my apache authentication issue. I don't have the permission to merge until the issue is resolved. May @RussellSpitzer can help?
@flyrain @kbendick since you have approved this, can one of you please merge it? (The only update since is one adding some blank lines after if blocks and tweaking of comments in the test class.)
Unfortunately I’m not able to merge yet as I’m not a committer. Hopefully either @flyrain’s credentials will get worked out or somebody else will handle it!
And sorry for any delay in response, I’ve been out very sick this week but trying to stay on top of things when I have the energy. 😷