lance
lance copied to clipboard
RFC: S3 concurrent writer via dynamoDB
Context
S3 does not support transactional-operations like *_if_not_exist.
We would like to support transactional-operations on S3 for concurrent writers.
Requirements
- Implement a native locking mechanism in Lance to support
*_if_not_existon s3. - This new mechanism should be usable in all language lance binds to.
- Ensure we still have snapshot isolation. (unsuccessful and en route commit should not be visable)
- (Nice to have) Pave the road for redis/sql based commit mechanisms.
Why DynamoDB?
Out of the three major Cloud providers, AWS(S3), GCP(GCS), and Azure(Blob Storage), S3 is only one that does not provide transactional semantics in its APIs. This means, it is okay to use an AWS native solution as we do not need to port this locking mechanism to other cloud providers.
(minio, http, et al. will probably use redis/sql commit mechanism)
Interface
Since this is a S3-only feature, it doesn't make a lot of sense to add more params to ObjectStoreParams nor [Read|Write]Params.
The proposed solution is to pass in dynamo params as part of the s3 URI with the following syntax
s3+ddb://<bucket>/<path>?ddbTableName=<ddb_table_name>
Problems with dynamodb-lock-rs
The library does not support unexpirable locks. This means, the library can only support synchronous* fail-stop model. (*: the lock lease duration effectively converts async consensus into a sync consensus with time as epoch boundaries).
This can cause data corruption under pause or network partition failure model. (The lock can expire while manifest commit is still going on)
Problem with unexpirable locks
They can crash-deadlock and are only recoverable by manual intervention
Crux of the problem
Having two sources of truth makes consensus with strong consistency requirement not possible. (S3 is the source of truth for manifest version, while Dynamo is the source of truth of which writer's manifest is valid)
Proposed solution
When committing to S3, we can use the following procedure
For procedure commit(manifest: Manifest) do
- Write the manifest we want to commit to S3 with a temporary name e.g.
dataset.lance/_versions/staging/152a8c85-e4e7-4a1f-96d1-1cef1c6dbd9d.manifest - Try to commit this version to dynamodb with
attribute_not_exists(PK) AND attribute_not_exists(SK)condition. Where the committed data would look like
{
pk: {table_uri}, <-- String
sk: {version}, <-- int
commiter: <hostname>, <-- String
uri: <temporary_uri> <-- String
}
- If dynamodb commit fails, return err
- Writer proceed to call
copyand copy the manifest from temporary path todataset.lance/_version/{version}.manifest - Update DynamoDB with
dataset.lance/_version/{version}.manifest
What if step 4 fails?
To address this problem, We should always load manifest from dynamodb. When loading manifest we can use the following procedure:
For procedure checkout_latest(uri: &str) do
- The loader reach out to dynamodb first and find the latest version V of the dataset.
- Ensure step 4, 5 on commit loop is done, if not, retry 4, 5
This checkout procedure can guarantee that
- When DynamoDB and S3 go out of sync, S3 is at most behind by one version
- All operations are blocked until the two stores are in sync again
Caveats
This solution changes the source of truth for manifests from s3 to dynamodb. This means
- Reader and writer must both use dynamodb to ensure consistency
- To off-board dynamodb, user would have to make sure the s3 and dynamodb are in sync before they can safely off-board.
Execution Plan
- [x] add
checkoutmethod toCommitHandler - [x] implement
ExternalManifestCommitHandler - [x] implement
DynamoDBS3CommitHandler - [x] implement parsing in
io::ObjectStoreto recognizes3+ddb://scheme - [ ] #1206
cc: @wjones127 @westonpace @eddyxu
This looks good. I wish we could didn't have to modify the read path, but there are some edge cases that make it necessary.
This sounds interesting!
I'm not sure I fully understand it. Seems DynamoDB is source of truth in system, however, the commit would regard as "success" and can be seen by reader when finally data is "post committed" in s3? And this will not change the process of reader, just make writer need to handle more roundtrip of write?
@mapleFU Yeah roughly. This is making DynamoDB the source of truth and S3 a replica, but with some constraints on how out-of-sync the two stores can be. On the read path, the reader will try to sync the two stores and refuse to load if the two stores can not be sync'd. As long as the two stores are in sync, users can freely offboard dynamodb.
See #1190 for the commit logic
Are there any rough back-of-the-envelopes on how this will work out in terms of cost compared to S3 alone or other workarounds? Another workaround is to switch to AWS EFS, but for analytical workloads reading/ writing the whole dataset frequently the read and write elastic bandwidth charge is horrific, like >50X the cost of S3 so far in my hands (the storage cost is ~7X but this is insignificant compared to the above).
If I understand correctly the proposed fix only stores a subset of what I'd normally see in my lance s3 directory (the manifest) on DynamoDB, rather than a full copy of the data? If anyone wanted to provide guesses on what fraction of the 1) GB storage and 2) Read / Write requests would route to DynamoDB compared to a just-S3 setup I'm happy to share my napkin math on total relative cost as I need this for figuring out viability of lancedb at my company.
By the way, if there are a lot of read/write requests going to DynamoDB, but a low total Read/ Write throughput in cumulative GB read/written and medium-ish total GB stored, then a commit mechanism in EFS with S3 as main storage might be another cost effective option. As far as I know, EFS does not charge by the request (although there may be annoying IOPS limits, not sure) and the storage cost is not so bad, so as long as the volume of data transferred to/from EFS is low it could maybe allow more read/ write request per dollar.
hi @sandias42, with dynamo manifest store, the expected amount of IOs are as follows Read:
- get latest dataset -- 1 full read (we always use consistent read)
- get version N -- no read, hits object store directly
Commit:
- no conflict -- 2 writes (1 for the initial temp commit, 1 for finalization)
- N conflicts -- 2 * (N + 1) writes
the total IOPS is num_commit/s * ^^^
w.r.t. to the read write size, we only commit a path like _version/3.manifest-<tmp-id> value to dynamodb. so the throughput util should be very low.
This is indeed very interesting and useful in my opinion. I could see usage documentation on the lanceDb guides on storage options. However I don't understand if this feature is fully available already or if it's still in development. I tried a very quick implementation of S3+dynamoDb but was encountering commit storage errors on table creation. Do you guys confirm that this is still expected since the tasks are not done or I'm doing something wrong?
Thank you
This is indeed very interesting and useful in my opinion. I could see usage documentation on the lanceDb guides on storage options. However I don't understand if this feature is fully available already or if it's still in development. I tried a very quick implementation of S3+dynamoDb but was encountering commit storage errors on table creation. Do you guys confirm that this is still expected since the tasks are not done or I'm doing something wrong?
Thank you
Hi,
Could you share a simple repro and the errors you see?
Hi @chebbyChefNEQ thank you for the quick reply. As I'm writing I managed to do some more tests that actually succeeded and the only part that I believe I changed was in the permissions towards the dynamo DB (I was using an already created table with only read permissions but I clearly forgot that I need to write stuff in there I apologize for the rookie mistake here).
Do we still need this after s3 supports put-if-not-exists? object_store has removed this support too.
We can close this issue as it's already implemented. My understanding is there are still a few "S3 compatible" solutions out there which haven't added put-if-not-exists and so we might want to hang onto it for a bit.