delta
delta copied to clipboard
[Spark] Managed Commits: add a DynamoDB-based commit owner
Which Delta project/connector is this regarding?
- [X] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)
Description
Taking inspiration from https://github.com/delta-io/delta/pull/339, this PR adds a Commit Owner Client which uses DynamoDB as the backend. Each Delta table managed by a DynamoDB instance will have one corresponding entry in a DynamoDB table. The table schema is as follows:
- tableId: String --- The unique identifier for the entry. This is a UUID.
- path: String --- The fully qualified path of the table in the file system. e.g. s3://bucket/path.
- acceptingCommits: Boolean --- Whether the commit owner is accepting new commits. This will only
- be set to false when the table is converted from managed commits to file system commits.
- tableVersion: Number --- The version of the latest commit.
- tableTimestamp: Number --- The inCommitTimestamp of the latest commit.
- schemaVersion: Number --- The version of the schema used to store the data.
- commits: --- The list of unbackfilled commits.
- version: Number --- The version of the commit.
- inCommitTimestamp: Number --- The inCommitTimestamp of the commit.
- fsName: String --- The name of the unbackfilled file.
- fsLength: Number --- The length of the unbackfilled file.
- fsTimestamp: Number --- The modification time of the unbackfilled file.
For a table to be managed by DynamoDB, registerTable
must be called for that Delta table. This will create a new entry in the db for this Delta table. Every commit
invocation appends the UUID delta file status to the commits
list in the table entry. commit
is performed through a conditional write in DynamoDB.
How was this patch tested?
Added a new suite called DynamoDBCommitOwnerClient5BackfillSuite
which uses a mock DynamoDB client. + plus manual testing against a DynamoDB instance.