delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Separating the log from the state

Open dispanser opened this issue 3 years ago • 6 comments

In the current architecture, the concepts of the delta table and the delta log are available in one single abstraction, DeltaTable. To update to the newest state, one can call update() and the table state follows accordingly.

However, I can see several use cases, mostly around stream processing, that are not actually interested in the current table state, but instead in the stream of changes. For such a use case, subscribing to a stream of actions would probably provide the better abstraction.

Ultimately, DeltaTable is just a representation of the aggregated state of the log up to a specific commit, so it would just be another subscriber to the delta log changes.

I wonder if it would make sense to have a first-class citizen, DeltaLog, exposing commits as a stream of batches of actions.

dispanser avatar Mar 20 '21 07:03 dispanser

I think that's a good idea for read only streaming use-case :) The log stream abstraction can expose both pull and push based interfaces. Then the existing DeltaTable implementation can be refactored into a pull based consumer. Read-only streaming consumers can use the push interface to react to new log entries in near realtime.

houqp avatar Mar 20 '21 20:03 houqp

A pull-based interface would probably be good enough as a start. With a push-based interface, someone has to poll the underlying file system anyway (assuming something like inotify is not practical for cloud-based storage backends), and I would leave that to the consumer.

I'll play with this idea a bit to see if it leads anywhere.

dispanser avatar Mar 22 '21 20:03 dispanser

Just to make sure we are talking about pull/push at the same abstraction level. In my first comment, I was referring to pull and push interfaces at the application level. For example, a pull interface, or API if you prefer, means the application code needs to explicitly call update method to pull latest changes from the transaction log at its own pace. While a push interface would let the application register callbacks or set up an event loop to process new log entries as them come in.

When it comes to actually implementing the push interface, we also have the choice of leveraging push or pull semantics at the storage level. For example, for local file system backend on Linux platform, we could leverage inotify to achieve real time push implementation. For backends like S3, we would need to actually pull new values from latest_version S3 object to simulate the push. There is also the option of using S3 event notification to achieve real time push with S3, but that's another topic ;)

houqp avatar Mar 22 '21 23:03 houqp

@dispanser Azure Databricks has "autoloader" feature, that resembles inotify behavior. Did you already try that?

nfx avatar May 24 '21 16:05 nfx

@houqp there's autoloader on aws as well :)

nfx avatar May 24 '21 16:05 nfx

related #661

roeap avatar Sep 07 '22 07:09 roeap