lance RFC: S3 concurrent writer via dynamoDB

Context

S3 does not support transactional-operations like *_if_not_exist. We would like to support transactional-operations on S3 for concurrent writers.

Requirements

Implement a native locking mechanism in Lance to support *_if_not_exist on s3.
This new mechanism should be usable in all language lance binds to.
Ensure we still have snapshot isolation. (unsuccessful and en route commit should not be visable)
(Nice to have) Pave the road for redis/sql based commit mechanisms.

Why DynamoDB?

Out of the three major Cloud providers, AWS(S3), GCP(GCS), and Azure(Blob Storage), S3 is only one that does not provide transactional semantics in its APIs. This means, it is okay to use an AWS native solution as we do not need to port this locking mechanism to other cloud providers.

(minio, http, et al. will probably use redis/sql commit mechanism)

Interface

Since this is a S3-only feature, it doesn't make a lot of sense to add more params to ObjectStoreParams nor [Read|Write]Params.

The proposed solution is to pass in dynamo params as part of the s3 URI with the following syntax

s3+ddb://<bucket>/<path>?ddbTableName=<ddb_table_name>

Problems with dynamodb-lock-rs

The library does not support unexpirable locks. This means, the library can only support synchronous* fail-stop model. (*: the lock lease duration effectively converts async consensus into a sync consensus with time as epoch boundaries).

This can cause data corruption under pause or network partition failure model. (The lock can expire while manifest commit is still going on)

Problem with unexpirable locks

They can crash-deadlock and are only recoverable by manual intervention

Crux of the problem

Having two sources of truth makes consensus with strong consistency requirement not possible. (S3 is the source of truth for manifest version, while Dynamo is the source of truth of which writer's manifest is valid)

Proposed solution

external_store_commit

When committing to S3, we can use the following procedure

For procedure commit(manifest: Manifest) do

Write the manifest we want to commit to S3 with a temporary name e.g. dataset.lance/_versions/staging/152a8c85-e4e7-4a1f-96d1-1cef1c6dbd9d.manifest
Try to commit this version to dynamodb with attribute_not_exists(PK) AND attribute_not_exists(SK) condition. Where the committed data would look like

{
  pk: {table_uri}, <-- String
  sk: {version}, <-- int
  commiter: <hostname>, <-- String
  uri: <temporary_uri> <-- String
}

If dynamodb commit fails, return err
Writer proceed to call copy and copy the manifest from temporary path to dataset.lance/_version/{version}.manifest
Update DynamoDB with dataset.lance/_version/{version}.manifest

What if step 4 fails?

To address this problem, We should always load manifest from dynamodb. When loading manifest we can use the following procedure:

For procedure checkout_latest(uri: &str) do

The loader reach out to dynamodb first and find the latest version V of the dataset.
Ensure step 4, 5 on commit loop is done, if not, retry 4, 5

This checkout procedure can guarantee that

When DynamoDB and S3 go out of sync, S3 is at most behind by one version
All operations are blocked until the two stores are in sync again

Caveats

This solution changes the source of truth for manifests from s3 to dynamodb. This means

Reader and writer must both use dynamodb to ensure consistency
To off-board dynamodb, user would have to make sure the s3 and dynamodb are in sync before they can safely off-board.

Execution Plan

[x] add checkout method to CommitHandler
[x] implement ExternalManifestCommitHandler
[x] implement DynamoDBS3CommitHandler
[x] implement parsing in io::ObjectStore to recognize s3+ddb:// scheme
[ ] #1206

Aug 29 '23 20:08 chebbyChefNEQ

cc: @wjones127 @westonpace @eddyxu

Aug 29 '23 20:08 chebbyChefNEQ

This looks good. I wish we could didn't have to modify the read path, but there are some edge cases that make it necessary.

Aug 29 '23 20:08 wjones127

This sounds interesting!

I'm not sure I fully understand it. Seems DynamoDB is source of truth in system, however, the commit would regard as "success" and can be seen by reader when finally data is "post committed" in s3? And this will not change the process of reader, just make writer need to handle more roundtrip of write?

Sep 01 '23 19:09 mapleFU

@mapleFU Yeah roughly. This is making DynamoDB the source of truth and S3 a replica, but with some constraints on how out-of-sync the two stores can be. On the read path, the reader will try to sync the two stores and refuse to load if the two stores can not be sync'd. As long as the two stores are in sync, users can freely offboard dynamodb.

See #1190 for the commit logic

Sep 01 '23 19:09 chebbyChefNEQ

Are there any rough back-of-the-envelopes on how this will work out in terms of cost compared to S3 alone or other workarounds? Another workaround is to switch to AWS EFS, but for analytical workloads reading/ writing the whole dataset frequently the read and write elastic bandwidth charge is horrific, like >50X the cost of S3 so far in my hands (the storage cost is ~7X but this is insignificant compared to the above).

If I understand correctly the proposed fix only stores a subset of what I'd normally see in my lance s3 directory (the manifest) on DynamoDB, rather than a full copy of the data? If anyone wanted to provide guesses on what fraction of the 1) GB storage and 2) Read / Write requests would route to DynamoDB compared to a just-S3 setup I'm happy to share my napkin math on total relative cost as I need this for figuring out viability of lancedb at my company.

May 29 '24 13:05 sandias42

By the way, if there are a lot of read/write requests going to DynamoDB, but a low total Read/ Write throughput in cumulative GB read/written and medium-ish total GB stored, then a commit mechanism in EFS with S3 as main storage might be another cost effective option. As far as I know, EFS does not charge by the request (although there may be annoying IOPS limits, not sure) and the storage cost is not so bad, so as long as the volume of data transferred to/from EFS is low it could maybe allow more read/ write request per dollar.

May 29 '24 13:05 sandias42

hi @sandias42, with dynamo manifest store, the expected amount of IOs are as follows Read:

get latest dataset -- 1 full read (we always use consistent read)
get version N -- no read, hits object store directly

Commit:

no conflict -- 2 writes (1 for the initial temp commit, 1 for finalization)
N conflicts -- 2 * (N + 1) writes

the total IOPS is num_commit/s * ^^^

w.r.t. to the read write size, we only commit a path like _version/3.manifest-<tmp-id> value to dynamodb. so the throughput util should be very low.

May 29 '24 17:05 chebbyChefNEQ

This is indeed very interesting and useful in my opinion. I could see usage documentation on the lanceDb guides on storage options. However I don't understand if this feature is fully available already or if it's still in development. I tried a very quick implementation of S3+dynamoDb but was encountering commit storage errors on table creation. Do you guys confirm that this is still expected since the tasks are not done or I'm doing something wrong?

Thank you

Oct 08 '24 12:10 Erry91

This is indeed very interesting and useful in my opinion. I could see usage documentation on the lanceDb guides on storage options. However I don't understand if this feature is fully available already or if it's still in development. I tried a very quick implementation of S3+dynamoDb but was encountering commit storage errors on table creation. Do you guys confirm that this is still expected since the tasks are not done or I'm doing something wrong?

Thank you

Hi,

Could you share a simple repro and the errors you see?

Oct 08 '24 13:10 chebbyChefNEQ

Hi @chebbyChefNEQ thank you for the quick reply. As I'm writing I managed to do some more tests that actually succeeded and the only part that I believe I changed was in the permissions towards the dynamo DB (I was using an already created table with only read permissions but I clearly forgot that I need to write stuff in there I apologize for the rookie mistake here).

Oct 08 '24 14:10 Erry91

Do we still need this after s3 supports put-if-not-exists? object_store has removed this support too.

Jul 04 '25 07:07 Xuanwo

We can close this issue as it's already implemented. My understanding is there are still a few "S3 compatible" solutions out there which haven't added put-if-not-exists and so we might want to hang onto it for a bit.

Jul 04 '25 13:07 westonpace

lance lance copied to clipboard

RFC: S3 concurrent writer via dynamoDB

Context

Requirements

Why DynamoDB?

Interface

Problems with dynamodb-lock-rs

Problem with unexpirable locks

Crux of the problem

Proposed solution

What if step 4 fails?

Caveats

Execution Plan

lance
lance copied to clipboard