delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Add replaceWhere functionality

Open MrPowers opened this issue 1 year ago • 10 comments

Description

PySpark has a cool replaceWhere function that lets you override existing data in a Delta table that matches a predicate with new data. Here's an example of the replaceWhere functionality:

df2 = spark.createDataFrame(
    [
        ("x", 7),
        ("y", 8),
        ("z", 9),
    ]
).toDF("letter", "number")

(
    df2.write.format("delta")
    .option("replaceWhere", "number >= 2")
    .mode("overwrite")
    .save("tmp/my_data")
)

What do folks think about adding replaceWhere functionality to Python deltalake.

It's possible that the Rust predicate argument in write_deltalake already exposes this functionality.

MrPowers avatar Dec 11 '23 16:12 MrPowers

I exposed the predicate parameter for the rust engine writer but it's currently not doing anything because the functionality in Rust is not built yet

ion-elgreco avatar Dec 11 '23 16:12 ion-elgreco

take

r3stl355 avatar Dec 22 '23 21:12 r3stl355

I'll give this a try

r3stl355 avatar Dec 22 '23 21:12 r3stl355

WriteBuilder uses predicate: Option<String> but has no implementation for it yet whereas DeleteBuilder uses predicate: Option<Expression>. I suggest harmonising by changing WriteBuilder to use predicate: Option<Expression>. Though this is a breaking change, predicate handling is not implemented in WriteBuilder so changing the type should not cause issues

r3stl355 avatar Dec 23 '23 10:12 r3stl355

It would be great to do this usig logical expressions rather then the physical ones - much like @Blajda recently updated for merge. The good thing there is we get some type coercion for free, which has been a hassle with expressions.

In python we will likely have to accept strings and do the parsing..

roeap avatar Dec 23 '23 10:12 roeap

@roeap I think we can start allowing arrow expressions as input, which we can serialize as substrait and then deserialize with Datafusion-substrait

ion-elgreco avatar Dec 23 '23 10:12 ion-elgreco

This would be a great goal, but I would say lets be consistent in that and make a deliberate API choice.

I.e not have substrait supported in one method but not the other...

Good news is substrait plans are of course logical plans :)

roeap avatar Dec 23 '23 10:12 roeap

I'll try that @roeap. As for

It would be great to do this usig logical expressions rather then the physical ones - much like @Blajda recently updated for merge.

is this the David's PR you are referring to? https://github.com/delta-io/delta-rs/pull/1969

r3stl355 avatar Dec 23 '23 10:12 r3stl355

@roeap we should be able to add this to merge, update, delete and write and then just add the conversion inside the pyo3 binding, so it's a Python only feature.

ion-elgreco avatar Dec 23 '23 10:12 ion-elgreco

@r3stl355 its #1720 had been up for a while before it got merged.

@ion-elgreco - sure to get started, and as you said right now this could just be internal. Substrait is a nice feature for rust as well, of course as alternative path since we are lookig to integrate into datafusions internal planning.

roeap avatar Dec 23 '23 10:12 roeap