delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Optionally ignore tombstones for read-only use cases

Open dispanser opened this issue 2 years ago • 10 comments

Description

In a discussion for PR #420, https://github.com/delta-io/delta-rs/pull/420#discussion_r703270551 an idea came up to have a config option to ignore all tombstone entries entirely. Applications that are only reading do not actually use tombstones.

Use Case

This would reduce memory footprint, in particular for read-only applications.

@houqp writes:

Letting read-only application specifying whether tombstones should be kept at all sounds like a neat trick to reduce memory usage :) I think it's better handled as a separate self-contained PR though because it's an optimization specific to a subset of readers, while the problem we are dealing where applies to both readers and write

I'd like to explore that idea further. We actually have a delta table that is mostly tombstones, which would see a major decrease in memory consumption.

dispanser avatar Sep 07 '21 18:09 dispanser

Extending on @dispanser 's idea, I think we could even extend it to all writers that doesn't handle checkpoints. By moving checkpoints to a dedicated writer/lambda function, we could even apply this optimization to all data writers.

houqp avatar Sep 07 '21 19:09 houqp

An easy way would be just an additional option/flag. And if set, then we'd keep tomstones list empty, destipe it's writer or reader

mosyp avatar Sep 08 '21 06:09 mosyp

I agree with @mosyp , let the user decide on instantiating the delta table.

However, it would be interesting if we're able to detect the misuse, i.e. panic / warn (?) when a writer attempts to write a checkpoint despite not having tombstones available. Not sure if that's easy to implement.

dispanser avatar Sep 08 '21 10:09 dispanser

@dispanser If we stick with an optional flag, then if it's set, then prevent of creating a checkpoint and telling user a meaningful error

mosyp avatar Sep 08 '21 10:09 mosyp

I'd make a quick foray into that topic if someone assigns this to me.

dispanser avatar Sep 08 '21 17:09 dispanser

yep, it should be easy to prevent side effect of this optimization by checking for the flag in both checkpoint and vacuum code path.

houqp avatar Sep 08 '21 19:09 houqp

Now that I'm able to load my large tombstone-filled delta table, I plan on revisiting this issue. Still considering myself new to rust, I'm wondering how this feature could be represented in the public API. I assume that pub async fn open_table(table_uri: &str) and related functions form the public interface. If this was scala, I'd just tuck on an additional method parameter ignoreTombstones: Boolen = false, which would be backwards compatible but expose the new feature to everyone who wants to use it. What's the best practice in rust?

dispanser avatar Sep 22 '21 13:09 dispanser

How about adding a new config value to https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_config.rs? Then we can pass in the full table config as an optional argument. We will have to either introduce breaking API change or add new functions. I am fine with us introducing breaking API change at the current stage of the project.

Another pattern I have seen from other projects is builder struct. So the code will look something like this:

let table = TableOpener::new(uri).with_config(config).with_version(1).open().await?;

Then if we need to add a new knob in the future, e.g. storage backend, we could add a new method to the TableOpener struct without breaking the existing users:

let table = TableOpener::new(uri).with_config(config).with_storage(storage).open().await?;

I am cool with either way, but I think the builder pattern is likely going to be the long term solution we would adopt.

houqp avatar Sep 23 '21 04:09 houqp

Thanks, @houqp ! I'll be looking into the builder pattern. As an added bonus, we can keep the top-level entry points for backwards-compatibility, by having them use that config mechanism.

dispanser avatar Sep 23 '21 05:09 dispanser

https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/builder.rs is a good example of how the builder pattern works in real world.

houqp avatar Sep 23 '21 06:09 houqp