iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Puffin: Add delete-vector-v1 blob type

Open rdblue opened this issue 1 year ago • 4 comments

This adds a blob type to the Puffin spec that can store a Roaring bitmap delete vector. This is in support of the row-level delete improvements proposed for Iceberg v3.

rdblue avatar Sep 30 '24 19:09 rdblue

I am going to share a PR with some basic implementation that follows this spec. We can use it as an example that will hopefully clarify some questions. Thanks for putting this together, @rdblue!

aokolnychyi avatar Oct 11 '24 04:10 aokolnychyi

I’m not sure it’s worth drawing a line in the sand over this particular issue and I’d like to talk about it a bit more as a community before we merge this. I don’t want to set a precedent of adding write requirements to the Iceberg spec that aren’t actually requirements for Iceberg. I feel like if we make this a pattern we will essentially be deferring design decisions and I don’t really feel comfortable with that.

This is my main concern, I don't think the technical differences here really present blockers, they just add some warts. I also think compatibility between table formats is good goal, but I worry that due to governance differences between Iceberg and Delta, things naturally will go slower in Iceberg, so we would in most cases likely be ceding design to another project. I'm happy to take a wait and see approach on the more philosophical issue here and move forward on this (ultimately I think people doing the work should have more of a say on approach).

emkornfield avatar Oct 11 '24 20:10 emkornfield

PR #11302 contains a sample implementation of this spec.

aokolnychyi avatar Oct 13 '24 23:10 aokolnychyi

I’m not sure it’s worth drawing a line in the sand over this particular issue and I’d like to talk about it a bit more as a community before we merge this. I don’t want to set a precedent of adding write requirements to the Iceberg spec that aren’t actually requirements for Iceberg. I feel like if we make this a pattern we will essentially be deferring design decisions and I don’t really feel comfortable with that.

This is my main concern, I don't think the technical differences here really present blockers, they just add some warts. I also think compatibility between table formats is good goal, but I worry that due to governance differences between Iceberg and Delta, things naturally will go slower in Iceberg, so we would in most cases likely be ceding design to another project. I'm happy to take a wait and see approach on the more philosophical issue here and move forward on this (ultimately I think people doing the work should have more of a say on approach).

I agree that we don't want to cede design to another project and not set a precedent. This should be an independent choice of whether we want to maintain compatibility in this case, based on weighing the benefits against the costs. This is definitely an Iceberg community decision.

To me, reducing fragmentation across formats is worth the cost of a few warts.

I think that compatibility with other table formats is a great goal but I do want to stress that I value our ability to read other formats much higher than our ability to write other formats.

This is true, but I'm not sure that we've had a case before where we know that we want to build basically the same thing. And in this case, if we want compatibility with existing code, then we would need to make sure we write the fields to keep the other readers functioning.

I also think that if the community roles were reversed in a similar situation, we would want the Delta community to consider compatibility when building a very similar feature, too.

rdblue avatar Oct 15 '24 17:10 rdblue

The vote has passed, so I merged this PR. Thanks @rdblue! Thanks everyone who reviewed!

aokolnychyi avatar Nov 02 '24 10:11 aokolnychyi

Linking this PR to #11122 for tracking.

aokolnychyi avatar Nov 04 '24 08:11 aokolnychyi