delta icon indicating copy to clipboard operation
delta copied to clipboard

Unable to write metadata to delta table using scala api

Open GeekSheikh opened this issue 3 years ago • 14 comments

Extension to #321

When using an api like the one below, the only supported method to write metadata is through the use of the sparkSession which works until a single spark session is performing multiple writes simultaneously. In this case, manipulating the spark session still doesn't provide the ability to pass accurate metadata to the table. Could we find a way to ensure metadata writes are possible at the writer level even when using the following syntax? Thanks

cc @zsxwing

DeltaTable.forPath(target.tableLocation).alias("target")
  .merge(updatesDF, mergeCondition)
  .whenMatched
  .updateAll()
  .whenNotMatched
  .insertAll()
  .execute()

Thank you

GeekSheikh avatar Dec 07 '21 21:12 GeekSheikh

can you assign it to me? thanks @zsxwing

dragongu avatar May 14 '22 12:05 dragongu

@dragongu Done. Thanks for offering the help!

zsxwing avatar May 17 '22 18:05 zsxwing

I like the feature request and would like to see it available for MERGE INTO statement and DML statements like INSERT/UPDATE.

There is another workaround to handle the case of concurrent writes. Spark session allows cloning it such that the clone can have different configuration values. This is used behind the scenes in structured streaming.

So maybe we can follow the same approach and have something like:

session_clone = spark.newSession()
session_clone.conf.set("spark.databricks.delta.commitInfo.userMetadata", "metadata message")
session_clone.sql("INSERT INTO ...")
commit_version = session_clone.conf.get("spark.databricks.delta.lastCommitVersionInSession")

The above is in Python but the same API is available in Scala plus an additional spark.cloneSession(). Seems there is a slight difference between cloneSession and newSession but PySpark has newSession only.

ylashin avatar Jun 28 '22 09:06 ylashin

Hi @GeekSheikh - can you a) update the issue description to clarify that the only way to do it for table API is using spark.conf.set("spark.databricks.delta.commitInfo.userMetadata", "test"), which is what breaks when multiple writes are occurring. b) what sort of API are you looking for? DeltaTable.withUserMetadata(...).merge(...)?

scottsand-db avatar Aug 26 '22 15:08 scottsand-db

@keen85 proposed the following APIs in #1383 (I'm going to close that issue and use this ticket to track any new update)

DeltaTable.delete(condition, userMetadata = None)
DeltaTable.update(condition, set, userMetadata= None)
DeltaTable.merge(source, condition, userMetadata= None)

zsxwing avatar Sep 19 '22 17:09 zsxwing

@zsxwing @scottsand-db is there any update on when this will be available?

jaketherianos avatar Sep 30 '22 13:09 jaketherianos

@jaketherianos, I think that nobody is actively working on it 🙁 I'd really love to see this feature but I'm not a software engineer.

keen85 avatar Sep 30 '22 13:09 keen85

any progress on this feature?

EkhartScheifes avatar Dec 16 '22 15:12 EkhartScheifes

I would like to take this feature on, if no one has claimed it; I use delta lake a lot professionally and would like this feature as well.

PeterDowdy avatar Jan 12 '23 20:01 PeterDowdy

@PeterDowdy i'm interested in your opinion on the API proposed in #1383

DeltaTable.delete(condition, userMetadata = None)
DeltaTable.update(condition, set, userMetadata= None)
DeltaTable.merge(source, condition, userMetadata= None)

Would be best to first get consensus on the API before you start implementing it.

@EkhartScheifes any input?

scottsand-db avatar Jan 12 '23 20:01 scottsand-db

I'm thinking about it, although I think the .delete and update APIs lock us into something like that. Other spark APIs follow a .option() api call convention that would be nice to mimic, e.g. .option('userMetadata','your_metadata_here') but since delete and update execute the delete and update immediately it's not an option. I'd be willing to move forward on adding that as args unless someone has a suggestion that keeps the call more consistent with the .option() api

PeterDowdy avatar Jan 12 '23 20:01 PeterDowdy

If there's no objection, I'll implement what's suggested, then.

PeterDowdy avatar Jan 20 '23 17:01 PeterDowdy