Scott Sandre
Scott Sandre
Posting for visibility: this PR is currently blocked by https://github.com/delta-io/connectors/pull/425, as we need a way to provide options to the sink API to be able to wrap this feature behind...
@horizonzy - when we create the checkpoint, we use `snapshot.allFilesScala` (see `Checkpoints.scala`). To generate `snapshot.allFilesScala`, we perform an in-memory-log-replay, where we keep track of the AddFiles seen so far using...
@horizonzy - perhaps. But this is only one delta client. We don't know who wrote the previous json files or checkpoints. If another delta client wrote the previous checkpoint, and...
Hi @horizonzy - we would need all clients to do this. Else, when you read a checkpoint, you wouldn't know which client wrote it and if it is sorted. Enforcing...
@horizonzy can you partition your data? We provide `snapshot.scan(Expression)` APIs to let you partition prune.
@horizonzy what if we added an API/config that sorted the data on read? also, what would be sort it by?
Hi @gopik - sorry for the delay. I've confirmed this issue myself, too. Want to copy over https://github.com/delta-io/delta/blob/master/core/src/test/scala/org/apache/spark/sql/delta/ActionSerializerSuite.scala from delta-io/delta repository? And perhaps investigate the fix?
@kristoffSC - have we seen any of the scala binary compatibility issues as mentioned above? have we tested on a real cluster?
@kristoffSC - sounds good. I added the action item `re-visit artifact packaging or explore this subject a little bit` to the Flink SQL/TableAPI/Catalog issue. https://github.com/delta-io/connectors/issues/238