delta
delta copied to clipboard
Unable to write metadata to delta table using scala api
Extension to #321
When using an api like the one below, the only supported method to write metadata is through the use of the sparkSession which works until a single spark session is performing multiple writes simultaneously. In this case, manipulating the spark session still doesn't provide the ability to pass accurate metadata to the table. Could we find a way to ensure metadata writes are possible at the writer level even when using the following syntax? Thanks
cc @zsxwing
DeltaTable.forPath(target.tableLocation).alias("target")
.merge(updatesDF, mergeCondition)
.whenMatched
.updateAll()
.whenNotMatched
.insertAll()
.execute()
Thank you
can you assign it to me? thanks @zsxwing
@dragongu Done. Thanks for offering the help!
I like the feature request and would like to see it available for MERGE INTO statement and DML statements like INSERT/UPDATE.
There is another workaround to handle the case of concurrent writes. Spark session allows cloning it such that the clone can have different configuration values. This is used behind the scenes in structured streaming.
So maybe we can follow the same approach and have something like:
session_clone = spark.newSession()
session_clone.conf.set("spark.databricks.delta.commitInfo.userMetadata", "metadata message")
session_clone.sql("INSERT INTO ...")
commit_version = session_clone.conf.get("spark.databricks.delta.lastCommitVersionInSession")
The above is in Python but the same API is available in Scala plus an additional spark.cloneSession()
.
Seems there is a slight difference between cloneSession
and newSession
but PySpark has newSession
only.
Hi @GeekSheikh - can you
a) update the issue description to clarify that the only way to do it for table API is using spark.conf.set("spark.databricks.delta.commitInfo.userMetadata", "test")
, which is what breaks when multiple writes are occurring.
b) what sort of API are you looking for? DeltaTable.withUserMetadata(...).merge(...)
?
@keen85 proposed the following APIs in #1383 (I'm going to close that issue and use this ticket to track any new update)
DeltaTable.delete(condition, userMetadata = None)
DeltaTable.update(condition, set, userMetadata= None)
DeltaTable.merge(source, condition, userMetadata= None)
@zsxwing @scottsand-db is there any update on when this will be available?
@jaketherianos, I think that nobody is actively working on it 🙁 I'd really love to see this feature but I'm not a software engineer.
any progress on this feature?
I would like to take this feature on, if no one has claimed it; I use delta lake a lot professionally and would like this feature as well.
@PeterDowdy i'm interested in your opinion on the API proposed in #1383
DeltaTable.delete(condition, userMetadata = None)
DeltaTable.update(condition, set, userMetadata= None)
DeltaTable.merge(source, condition, userMetadata= None)
Would be best to first get consensus on the API before you start implementing it.
@EkhartScheifes any input?
I'm thinking about it, although I think the .delete
and update
APIs lock us into something like that. Other spark APIs follow a .option()
api call convention that would be nice to mimic, e.g. .option('userMetadata','your_metadata_here')
but since delete
and update
execute the delete and update immediately it's not an option. I'd be willing to move forward on adding that as args unless someone has a suggestion that keeps the call more consistent with the .option()
api
If there's no objection, I'll implement what's suggested, then.