iceberg Spark: Add custom metric for number of deletes applied by a SparkScan

Spark: Add custom metric for number of deletes applied by a SparkScan

Open wypoon opened this issue 2 years ago • 20 comments

This is an extension of #4395. Here we add a custom metric for the number of delete rows that have been applied in a scan of a format v2 table.

We introduce a counter in BatchDataReader and RowDataReader, that is incremented when a delete is applied. This counter is passed into DeleteFilter, and in the cases where we construct a PositionDeleteIndex, is passed into the implementation of the PositionDeleteIndex. In all the read paths, the counter is incremented whenever a delete is applied. When Spark calls currentMetricsValues() on a PartitionReader, which is a subclass of either BatchDataReader or RowDataReader, we get the current value of the counter and return that.

Tested manually by creating a format v2 table using each of Parquet, ORC, and Avro files, deleting and updating rows in the tables, and reading from the table. The expected number of deletes show up in the Spark UI. Also extended existing unit tests (DeleteReadTests) to count the number of deletes applied during the scan.

Apr 19 '22 17:04 wypoon

@rdblue @RussellSpitzer would you please review?

Apr 20 '22 17:04 wypoon

@RussellSpitzer thank you for reviewing! This PR is on top of #4395, but since that is not merged, all its changes are here. I'll clean up the logging. It is true that their original purpose was for debugging/testing, but I think some of them are still potentially helpful.

Apr 21 '22 16:04 wypoon

@RussellSpitzer I have cleaned up the logging, and reworked the unit tests, on top of the updates to #4395.

Apr 25 '22 16:04 wypoon

@RussellSpitzer please review again when you can. Thanks.

May 04 '22 17:05 wypoon

@flyrain could you take a look at this?

May 18 '22 22:05 RussellSpitzer

@kbendick and @RussellSpitzer thank you for your reviews and suggestions. I have removed DeleteCounter from BitmapPositionDeleteIndex. I have implemented the suggestion to extract the custom metric from Spark after running a DataFrame query (although this turned out to be trickier than expected due to async issues (that did not surface locally but surfaced in the CI)). I think the PR is much better now.

May 22 '22 23:05 wypoon

I think we would also really benefit from @flyrain doing a full review on this as well especially now that we have the "markDelete" pathway as well. I assume for that we probably will just skip counting deletes since we don't really care.

@RussellSpitzer I saw that Deletes and DeleteFilter have changed in master since I updated this PR (so this PR no longer merges), but I haven't had the time to understand the changes and how to reconcile this PR with them. I don't understand your statement that "we probably will just skip counting deletes since we don't really care." Are you saying that this PR is no longer worthwhile??

Jun 02 '22 17:06 wypoon

I think we would also really benefit from @flyrain doing a full review on this as well especially now that we have the "markDelete" pathway as well. I assume for that we probably will just skip counting deletes since we don't really care.

@RussellSpitzer I saw that Deletes and DeleteFilter have changed in master since I updated this PR (so this PR no longer merges), but I haven't had the time to understand the changes and how to reconcile this PR with them. I don't understand your statement that "we probably will just skip counting deletes since we don't really care." Are you saying that this PR is no longer worthwhile??

No! I think this PR is very important, there is just now a second read path with _is_deleted metadata column which actually returns deleted rows. When we follow that path I'm not sure it's important for us to count the deleted rows since we'll be returning them anyway, I could go either way. For the path when we don't actually return deleted rows (the normal read path) we definitely want this metric!

Jun 02 '22 17:06 RussellSpitzer

I think we would also really benefit from @flyrain doing a full review on this as well especially now that we have the "markDelete" pathway as well. I assume for that we probably will just skip counting deletes since we don't really care.

@RussellSpitzer I saw that Deletes and DeleteFilter have changed in master since I updated this PR (so this PR no longer merges), but I haven't had the time to understand the changes and how to reconcile this PR with them. I don't understand your statement that "we probably will just skip counting deletes since we don't really care." Are you saying that this PR is no longer worthwhile??

No! I think this PR is very important, there is just now a second read path with _is_deleted metadata column which actually returns deleted rows. When we follow that path I'm not sure it's important for us to count the deleted rows since we'll be returning them anyway, I could go either way. For the path when we don't actually return deleted rows (the normal read path) we definitely want this metric!

Thanks for the explanation. I haven't followed the changes that recently went in. Your explanation helps put them in context for me. I'll rebase on master and try to address your current feedback.

Jun 02 '22 17:06 wypoon

Sorry, I haven't got chance to review. The changes of class Deletes and DeleteFilter are introduced by my PR #4683, which can read deleted rows by using metadata column _deleted. It may affect how we collect the deletes metrics. Would recommend to take a look at #4683, @wypoon. I will also review this PR a bit later.

Jun 02 '22 17:06 flyrain

Agreed with @RussellSpitzer , we probably don't need the metrics if a user queries all deleted rows (enabled by #4683). It is more useful if when user query a data file with deletes applied since the number of deletes are hidden from users. I'm also worried about the perf implication. Would recommend to benchmark the potential perf loss.

Jun 02 '22 18:06 flyrain

Benchmarking could be good, although I'm gonna be real surprised if it's that heavy. I'll take bets on undetectable difference ... I'm assuming all the function calls will end up getting inlined during runtime optimizations

Jun 02 '22 18:06 RussellSpitzer

I'm working on rebasing on master, reconciling the conflicts with #4683.

Jun 06 '22 17:06 wypoon

@RussellSpitzer @flyrain I have rebased on master and reconciled the conflicts with #4683. Counting deletes only happen for the normal read path, not the new path added by #4683. Russell, I have adopted your suggestions for TestSparkReaderDeletes. For the other questions/points you raised, I have answered them earlier. I don't think this work has any performance impact. The costs of incrementing an in-memory counter are dwarfed by the costs of opening and reading delete files.

Jun 07 '22 04:06 wypoon

@flyrain thanks for reviewing. I'll look into your comments.

Jun 27 '22 20:06 wypoon

I have been on a break. Just returning to this. I'm rebasing on master since there are conflicts. I'll follow up after that.

Jul 15 '22 00:07 wypoon

I ran IcebergSourceParquetPosDeleteBenchmark and IcebergSourceParquetEqDeleteBenchmark on the state of master when I last rebased and on my last change. The results for readIceberg and readIcebergVectorized are as follows: Screen Shot 2022-08-03 at 11 13 05 AM

There is hardly any difference for IcebergSourceParquetPosDeleteBenchmark. For IcebergSourceParquetEqDeleteBenchmark, the numbers show greater difference (some worse and some better, comparing after to before), but due to the much greater variances, I think the differences are within the variances.

Aug 03 '22 18:08 wypoon

I'm going to rebase on master again and conform to the new spotless code style so that this can merge.

Aug 03 '22 18:08 wypoon

@RussellSpitzer @flyrain I have rebased on master. @flyrain I have fixed the bug you pointed out; thank you for pointing it out. (As a result of fixing the bug, it became unnecessary to change PositionDeleteIndex.) Please review this again. I kept the changes in Spark 3.2, since I was rebasing. I can port the changes to Spark 3.3 in this same PR (and am inclined to do so), if the changes look good to you.

Aug 04 '22 16:08 wypoon

I have ported the changes to Spark 3.3. The Spark 3.3 changes are identical to the Spark 3.2 changes (the affected files are identical at this point).

Aug 05 '22 17:08 wypoon

@flyrain thanks for reviewing this again. I responded to your comments. Aside from the issue regarding counting the deletes for the case where there is a _is_deleted metadata column, I'll go ahead and make the changes.

Aug 20 '22 00:08 wypoon

@RussellSpitzer @flyrain I have updated the PR. I went ahead and implemented the necessary changes so that the number of deletes applied is counted whether or not there is a _deleted metadata column in the read. I have also addressed the other feedback.

Aug 23 '22 01:08 wypoon

@flyrain thank you for reviewing. I hope this can be merged now.

Aug 26 '22 22:08 wypoon

Great. Thanks @wypoon for working on this. Hi @RussellSpitzer and @kbendick, do you want to take another look and approve it? There is an authentication issue from my side that I still don't have the permission to merge even though I'm a committer now. Hopefully, my authentication issue could be resolved next week.

Aug 26 '22 23:08 flyrain

Great. Thanks @wypoon for working on this. Hi @RussellSpitzer and @kbendick, do you want to take another look and approve it? There is an authentication issue from my side that I still don't have the permission to merge even though I'm a committer now. Hopefully, my authentication issue could be resolved next week.

Yeah sure let me do another pass. Congrats again on becoming a committer!

Aug 26 '22 23:08 kbendick

Thanks @kbendick!

Aug 26 '22 23:08 flyrain

@kbendick thanks for reviewing. I have addressed some minor nits.

Aug 27 '22 19:08 wypoon

@flyrain @kbendick since you have approved this, can one of you please merge it? (The only update since is one adding some blank lines after if blocks and tweaking of comments in the test class.)

Aug 31 '22 16:08 wypoon

@wypoon, sorry for the delay, working on my apache authentication issue. I don't have the permission to merge until the issue is resolved. May @RussellSpitzer can help?

Aug 31 '22 17:08 flyrain

@flyrain @kbendick since you have approved this, can one of you please merge it? (The only update since is one adding some blank lines after if blocks and tweaking of comments in the test class.)

Unfortunately I’m not able to merge yet as I’m not a committer. Hopefully either @flyrain’s credentials will get worked out or somebody else will handle it!

And sorry for any delay in response, I’ve been out very sick this week but trying to stay on top of things when I have the energy. 😷

Aug 31 '22 18:08 kbendick

iceberg iceberg copied to clipboard

Spark: Add custom metric for number of deletes applied by a SparkScan

iceberg
iceberg copied to clipboard