delta icon indicating copy to clipboard operation
delta copied to clipboard

Update Protocol Spec for Deletion Vectors

Open larsk-db opened this issue 2 years ago • 1 comments

Description

  • This PR makes the concrete changes proposed in #1367 to the Delta protocol specification. For details of what this proposal entails, see that issues.
  • In addition, this PR makes some clarification changes to the wording in the spec in various places, many of which where necessary to correctly reflect concepts introduced by the proposal (e.g., logical files, exact column stat semantics).

How was this patch tested?

N/A (document-only).

Does this PR introduce any user-facing changes?

No.

larsk-db avatar Sep 07 '22 14:09 larsk-db

Left a few high-level comments. We need to bump the protocol versions, so please add version (3,7) in the version list, and add the necessary discussion about version consideration similar to how Column Mapping has added it.

About the version update: I'm a bit concerned that we are adding more and more features that every reader needs to implement along the way. Like, to have DVs, every reader now needs to support Column Mapping, because our protocol versions introduce this linear dependency between completely unrelated functionality. I wonder if there's isn't something we could do here to reduce the protocol support burden.

larsk-db avatar Sep 13 '22 10:09 larsk-db

Left a few high-level comments. We need to bump the protocol versions, so please add version (3,7) in the version list, and add the necessary discussion about version consideration similar to how Column Mapping has added it.

About the version update: I'm a bit concerned that we are adding more and more features that every reader needs to implement along the way. Like, to have DVs, every reader now needs to support Column Mapping, because our protocol versions introduce this linear dependency between completely unrelated functionality. I wonder if there's isn't something we could do here to reduce the protocol support burden.

We currently have a floating idea to solve this issue: instead of bundling multiple features (used and unused) into a single protocol version number, we can store only used features in the log, so readers can can read the table when all features that are actually used are supported.

A rough example:

  • Now: table is on reader protocol 3, which includes Column Mapping and DV
    • Reader must support both features to read the table
  • Future: table is using only one feature DELETION_VECTOR
    • Reader can read the table if it supports reading DV while do not support Column Mapping

IMO the example above is not a good one as Column Mapping support is much easier to implement than reading DVs. Nevertheless, the idea is that, by doing so we can avoid forcing clients to support all features up to a certain protocol version, and instead give them the freedom to choose supporting features they are interested in the most.

What do you think of this idea?

xupefei avatar Sep 14 '22 09:09 xupefei

We currently have a floating idea to solve this issue [...] What do you think of this idea?

That sounds great. By floating idea you mean, this is something we could actually add relatively soon? Like, would it make sense to delay this protocol change until we can implement it using the mechanism you described above?

larsk-db avatar Sep 15 '22 11:09 larsk-db

We currently have a floating idea to solve this issue [...] What do you think of this idea?

That sounds great. By floating idea you mean, this is something we could actually add relatively soon? Like, would it make sense to delay this protocol change until we can implement it using the mechanism you described above?

Yes! We're currently drafting a doc to describe it in detail. The idea is to make it play nice with existing protocol behaviors, thus fewer bugs and faster implementation.

When do you plan to release DV? If there's still some time, we could bump the protocol version for the feature thing I am working on, then make DV the first feature it supports 😀

xupefei avatar Sep 15 '22 11:09 xupefei

When do you plan to release DV? If there's still some time, we could bump the protocol version for the feature thing I am working on, then make DV the first feature it supports 😀

Over the next month or so would be good. I'm fine with waiting for the feature thingy and punting on adding version info to this PR, if it's alright with @tdas and @scottsand-db?

larsk-db avatar Sep 15 '22 12:09 larsk-db

I added a reader version now, so we have something there, but my intention is still to hold this until @xupefei introduces the feature thingy discussed above.

larsk-db avatar Oct 10 '22 12:10 larsk-db