iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

WIP: Remove redundant sorts from copy on write deletes

Open SinghAsDev opened this issue 2 years ago • 15 comments

This diff includes following changes.

  1. Enables reading entire file as one split in copy-on-write operations.
  2. Writes sort-order-id from spark writers into data files.
  3. Enables reading sort-order-id when reading through spark readers.
  4. Enables skipping distribution and ordering in case of copy-on-write operations to make these operations efficient.

SinghAsDev avatar May 04 '22 20:05 SinghAsDev

Hi @aokolnychyi @RussellSpitzer @rdblue wanted with check with you all if there are any high level concerns on this approach?

I will add more tests, and will either make file-as-split generic for any scans, or rename it to make it obvious that it is limited to copy-on-write scans.

SinghAsDev avatar May 06 '22 17:05 SinghAsDev

Thanks for the PR, @SinghAsDev! I can take a look next week.

aokolnychyi avatar May 15 '22 00:05 aokolnychyi

I spent some time thinking about this. Let me summarize how I understand the proposal.

The use case we are talking about is copy-on-write DELETE executed using a broadcast join where we read files from the current spec, only one file per split, files are already reasonably compacted and sorted as needed. Right now, we can avoid the shuffle by setting the distribution mode to none but we can't disable a potentially redundant local sort. Is my understanding correct?

We can't target UPDATE and MERGE as those may change the ordering/partition of records even if we read properly sorted/partitioned files. We can't target tables with small files as having a split per file will be costly. The first two restrictions seem reasonable but I am afraid we can't target DELETE that triggers a shuffle as it will distribute records across tasks by the delete condition, not necessarily by the partition/sort key. That poses a problem as Iceberg does not know which join implementation Spark is going to pick so we can't say whether the sort is really redundant. The proper solution would be to report ordering of scan tasks to Spark, request required distribution/ordering on write and let Spark decide if the sort is redundant. There was a Spark proposal for exposing tasks ordering but it is not available yet.

If we had a way to pass options to DELETE commands, we could simply support the same using these steps:

  • Set read.split.open-file-cost to Long.MaxValue to force one file per split
  • Set write.delete.distribution-mode to none
  • Set write option use-table-distribution-and-ordering to false

I will be adding OPTIONS to row-level commands in Spark too but it won't be there until Spark 3.4.

aokolnychyi avatar May 18 '22 20:05 aokolnychyi

BTW, I think propagating the sort order ID to written files is something we should do. We have to be careful to do that only when we actually requested a sort order. @SinghAsDev, would you be interested in submitting a PR for that? It is independent of what we are going to do here.

aokolnychyi avatar May 18 '22 20:05 aokolnychyi

Thoughts, @RussellSpitzer @szehon-ho @rdblue @flyrain?

aokolnychyi avatar May 18 '22 20:05 aokolnychyi

I'm a little worried about the instance in which every file was written with the correct sort order BUT were written by independent writes. In this case I have dozens of files which overlap, but they all have the same sort order. In this case I'm not sure it makes sense to ignore the sort request, in this case it wouldn't be redundant and we would be better off if we apply the distribution.

For disabling the sort I think disabling the distribution mode is good enough? Although maybe file as task is a nice bonus there? I'm a little worried since that parameter is a bit of a confusing internal implementation detail for an external user to wrap their head around.

I am strongly in support though of passing through "sort order" to files and want that PR if we can do it soon.

RussellSpitzer avatar May 18 '22 20:05 RussellSpitzer

I'm a little worried about the instance in which every file was written with the correct sort order BUT were written by independent writes. In this case I have dozens of files which overlap, but they all have the same sort order. In this case I'm not sure it makes sense to ignore the sort request, in this case it wouldn't be redundant and we would be better off if we apply the distribution.

I guess the answer would be "it depends". In case of @SinghAsDev, the files seem to be properly compacted and sorted so that only a small number of files overlap. Anyway, the point I was trying to make is there are lot of assumptions that must be met that makes this use case pretty narrow. On top of that, we can't do that at all as Iceberg doesn't know whether a broadcast join will be used.

aokolnychyi avatar May 18 '22 21:05 aokolnychyi

I want to say i'm not against heuristic optimizations, but I do think we need to make them optional, explicitly enabled/disabled, and they should do the right thing at least 90% of the time.

RussellSpitzer avatar May 18 '22 21:05 RussellSpitzer

The use case we are talking about is copy-on-write DELETE executed using a broadcast join where we read files from the current spec, only one file per split, files are already reasonably compacted and sorted as needed. Right now, we can avoid the shuffle by setting the distribution mode to none but we can't disable a potentially redundant local sort. Is my understanding correct?

That is correct. However, we want to remove sorts automatically without users having to figure out if all files are already sorted or not. So, imaging a dataset that goes through deletion daily. The first deletion should do a global sort while re-writing, however future deletes should not have to perform any sort (not even local sort).

If we had a way to pass options to DELETE commands, we could simply support the same using these steps:

  • Set read.split.open-file-cost to Long.MaxValue to force one file per split
  • Set write.delete.distribution-mode to none
  • Set write option use-table-distribution-and-ordering to false

I will be adding OPTIONS to row-level commands in Spark too but it won't be there until Spark 3.4.

For this to work users would have to first figure out if all files are sorted or not and then do things differently accordingly. If that is something user want to do they can do that by explicitly adding order by only in first delete. However, there will still be a redundant local sort in subsequent deletes.

SinghAsDev avatar May 18 '22 21:05 SinghAsDev

BTW, I think propagating the sort order ID to written files is something we should do. We have to be careful to do that only when we actually requested a sort order. @SinghAsDev, would you be interested in submitting a PR for that? It is independent of what we are going to do here.

Will do.

SinghAsDev avatar May 18 '22 21:05 SinghAsDev

That poses a problem as Iceberg does not know which join implementation Spark is going to pick so we can't say whether the sort is really redundant.

Yea, I definitely think this should be configurable optimization at best. But, I won't be surprised if many GDPR use-cases benefit from it.

SinghAsDev avatar May 18 '22 21:05 SinghAsDev

However, we want to remove sorts automatically without users having to figure out if all files are already sorted or not.

Yeah, I agree but I am afraid Iceberg can't do this on its own as we don't know whether there will be a shuffle. We can't remove the local sort if there is a subquery that is rewritten as a sort-merge join, can we?

aokolnychyi avatar May 18 '22 22:05 aokolnychyi

However, we want to remove sorts automatically without users having to figure out if all files are already sorted or not.

Yeah, I agree but I am afraid Iceberg can't do this on its own as we don't know whether there will be a shuffle. We can't remove the local sort if there is a subquery that is rewritten as a sort-merge join, can we?

Yea, we can't tell for sure, but wondering if you see value in allowing users who know what they are doing to turn this optimization on. We are experimenting with this at Pinterest already and see significant improvements.

SinghAsDev avatar May 20 '22 04:05 SinghAsDev

Yea, we can't tell for sure, but wondering if you see value in allowing users who know what they are doing to turn this optimization on.

I'd be open for that but I feel like we need to think a little bit more about how to expose this.

@SinghAsDev, do you happen to have some performance numbers to share? I'd image a local sort be fairly cheap unless there is a shuffle spill or substantial GC time.

aokolnychyi avatar May 24 '22 02:05 aokolnychyi

I'd be open for that but I feel like we need to think a little bit more about how to expose this.

Yea, happy to discuss more along this.

@SinghAsDev, do you happen to have some performance numbers to share? I'd image a local sort be fairly cheap unless there is a shuffle spill or substantial GC time.

We saw 10-20% cost increase with local sort. The cost also goes up with number of keys the sort is on.

SinghAsDev avatar Jun 15 '22 14:06 SinghAsDev