delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

fix(python): make table version always latest before doing merge

Open ion-elgreco opened this issue 1 year ago • 9 comments

Description

Small trivial change, but enforces table is always latest version before executing MERGE.

ion-elgreco avatar Nov 30 '23 08:11 ion-elgreco

is this something we want to do?

I would assume in actual user code, there may be some previous operations that took the current table state and may have derived some operation from it. At least I would expect that once I loaded the table, it remains at that state util I tell it otherwise :).

roeap avatar Nov 30 '23 12:11 roeap

@roeap if it's not the latest version while executing merge, you will get very strange errors regarding the schema which don't make sense. Also we are doing this in the writer as well, so we update incremental before writing.

Also, I am not sure if it even would make sense to load an older version and then merge on those files, what would the expected result be there? If you want to achieve this I think you need to restore first to that version and then merge.

ion-elgreco avatar Nov 30 '23 12:11 ion-elgreco

hmm .. generally speaking all query engines that I am aware of treat the table state as a snapshot. Our DeltaTable abstraction right now is somewhere between an actual snapshot and a snapshot factory.

I would have to look at the actual code paths again, to see what we are doing but generally speaking doing "hidden" updates can be quite dangerous, since the conflict resolution makes certain assumptions - which may not be relevant here. In this case I guess its not too bad, since it is before planning the actual merge.

However since its literally one line, could users not just do that call before themselves?

Maybe @wjones127 has an opinion on that?

roeap avatar Nov 30 '23 12:11 roeap

This reminds me a lot of the issue with #1863 that I hit. I agree with @roeap's assessment here that this could be dangerous. Would it it be possible to peek at the latest table state instead and determine if loaded_version != peeked_current_version and throw an error in that case?

While I agree this is a user responsibility, the fact that @ion-elgreco hit a problem here tells me the API is still not as safe as it could be :smile:

rtyler avatar Dec 02 '23 17:12 rtyler

@rtyler that makes sense. I don't think we have exposed something like that, but I guess this is the correct function we want to use: https://github.com/delta-io/delta-rs/blob/18c48349345d1bb6d23737d8a81c98784a013889/crates/deltalake-core/src/table/mod.rs#L456

I'll make the change tomorrow :)

ion-elgreco avatar Dec 02 '23 18:12 ion-elgreco

Would it it be possible to peek at the latest table state instead and determine if loaded_version != peeked_current_version and throw an error in that case?

Doing that can be really annoying if there are concurrent writers, though.

wjones127 avatar Dec 05 '23 03:12 wjones127

@wjones127 what shall we do here? Because in the writes we do update the table incrementally and then write

ion-elgreco avatar Dec 17 '23 15:12 ion-elgreco

@wjones127 what shall we do here? Because in the writes we do update the table incrementally and then write

I actually came across this in another setting and was going to propose making the update in write_deltalake optional.

Part of the problem here is if you want to make serializable commits, you need a way to tie the data you're about to write back to its 'source' version for the serializability check. But by updating the table to latest, you're almost always asserting that the source of your read was the latest table.

Imagine you have a table with one column with a series of integers in them. In the presence of two concurrent writers A and B, you might have A updates the table to negate all the values, and then B updates the table to double all the values. The rust version will correctly throw if these writes happen concurrently, telling you that a new write has happened which invalidates your old one. The python version by updating to latest, allows whichever gets there the second one to win.

emcake avatar Dec 20 '23 16:12 emcake

@emcake that doesn't sound right indeed. I'll put it back in draft, maybe it's good that we align this and then make the necessary changes also in the writer

ion-elgreco avatar Dec 26 '23 09:12 ion-elgreco