delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Document how to configure dynamodb lock client

Open wjones127 opened this issue 2 years ago • 17 comments

Description

Although we have an error message telling users to configure the Lock client if they want concurrent writes with S3, we don't have any documentation on how to do that. We should also provide general advice on concurrency, like not mixing different connectors in concurrent writers.

See conversation: https://delta-users.slack.com/archives/C013LCAEB98/p1674435354811639

Use Case

Related Issue(s)

We probably shouldn't do this until we improve the conflict resolution, though. https://github.com/delta-io/delta-rs/issues/593

wjones127 avatar Jan 23 '23 03:01 wjones127

@MrPowers this would probably be a good thing to blog about once the conflict resolution is improved. Concurrent writes is definitely something you can't do with plain Parquet tables. 😉

wjones127 avatar Jan 23 '23 03:01 wjones127

Let me know if I can help you in this, we'll need this feature. 🙂

LucaSoato avatar Jan 25 '23 16:01 LucaSoato

@wjones127 - feel free to assign me to this issue. I will be happy to create the docs when #593 is finished.

MrPowers avatar Jan 26 '23 19:01 MrPowers

Hi folks, is it possible to have a draft document first so that everyone can start to try and provide feedback? Or just wonder if there is already a guide somewhere else? Thanks! 😃

  • Currently https://github.com/delta-io/delta-rs/tree/main/dynamodb_lock does not say anything about how to set up.
  • As Delta Users' Slack is on free plan image we won't be able to see https://delta-users.slack.com/archives/C013LCAEB98/p1674435354811639 which @wjones127 refers to in the first post.

hongbo-miao avatar May 12 '23 18:05 hongbo-miao

I'm looking for the documentation on how to setup the LockClient in Python as well.

yuhanz avatar Dec 04 '23 18:12 yuhanz

In crates/deltalake-core/src/test_utils.rs, seems like it just need to setup an environment variable to point to a DynamoDB table by DYNAMO_LOCK_TABLE_NAME:

set_env_if_not_set(s3_storage_options::AWS_ACCESS_KEY_ID, "deltalake");
set_env_if_not_set(s3_storage_options::AWS_SECRET_ACCESS_KEY, "weloverust");
set_env_if_not_set("AWS_DEFAULT_REGION", "us-east-1");
set_env_if_not_set(s3_storage_options::AWS_REGION, "us-east-1");
set_env_if_not_set(s3_storage_options::AWS_S3_LOCKING_PROVIDER, "dynamodb");
set_env_if_not_set("DYNAMO_LOCK_TABLE_NAME", "test_table");
set_env_if_not_set("DYNAMO_LOCK_REFRESH_PERIOD_MILLIS", "100");
set_env_if_not_set("DYNAMO_LOCK_ADDITIONAL_TIME_TO_WAIT_MILLIS", "100");

In a different project, it documented the table schema of the dynamodb table: https://github.com/delta-io/kafka-delta-ingest#writing-to-s3

aws dynamodb create-table --table-name delta_rs_lock_table \
    --attribute-definitions \
        AttributeName=key,AttributeType=S \
    --key-schema \
        AttributeName=key,KeyType=HASH \
    --provisioned-throughput \
        ReadCapacityUnits=10,WriteCapacityUnits=10

(The same schema is documented in python/deltalake/writer.py as well)

  • Key Schema: AttributeName=key, KeyType=HASH
  • Attribute Definitions: AttributeName=key, AttributeType=S

However, the python documentation python/docs/source/usage.rst explicitly says to specify the options in storage_options . So the environment variable may not be required. I am going to give this one a try.

    >>> from deltalake import write_deltalake
    >>> df = pd.DataFrame({'x': [1, 2, 3]})
    >>> storage_options = {'AWS_S3_LOCKING_PROVIDER': 'dynamodb', 'DYNAMO_LOCK_TABLE_NAME': 'custom_table_name'}
    >>> write_deltalake('s3://path/to/table', df, 'storage_options'= storage_options)

yuhanz avatar Dec 04 '23 19:12 yuhanz

@yuhanz hey, did you find the correct solution for Python?

Edit: this worked with deltalake 0.15.1

danielgafni avatar Jan 17 '24 22:01 danielgafni

@danielgafni : I went with storage_options, and it worked well with deltalake 0.13.0.

storage_options = {
    "AWS_DEFAULT_REGION": "us-east-1",
    "AWS_ACCESS_KEY_ID": AWS_ACCESS_KEY_ID,
    "AWS_SECRET_ACCESS_KEY": AWS_SECRET_ACCESS_KEY,
    # "AWS_S3_ALLOW_UNSAFE_RENAME": "true",
    'AWS_S3_LOCKING_PROVIDER': 'dynamodb',
    'DYNAMO_LOCK_TABLE_NAME': 'MyLockTable',
}

yuhanz avatar Jan 20 '24 07:01 yuhanz

Thanks. I'm on 0.15.1. Just setting the environment variable "AWS_S3_LOCKING_PROVIDER" worked for me (with the default "delta_log" table name).

danielgafni avatar Jan 20 '24 08:01 danielgafni

I think it's also worth documenting the required permissions to work on a deltalake stored on AWS S3.

In my case, I needed:

  • On the bucket storing the deltalake: s3:GetObject, s3:PutObject, s3:DeleteObject. Permission to delete is needed for temporary files in the log folder, even if you're just appending.
  • On the DynamoDB table: dynamodb:GetItem, dynamodb:Query, dynamodb:PutItem, dynamodb:UpdateItem. I've seen some code that also calls create_table, I don't know if it's used or not, but I created the table manually and avoiding this permission caused no problems to me.

ale-rinaldi avatar Jan 30 '24 11:01 ale-rinaldi

@wjones127 when using an S3 compatible storage (other than AWS S3), one might have a set of access and secret key for the storage and another set for the DynamoDB. In this case, how to provide these two pairs (of access and secret keys) separately so one is used for storage and the other for DynamoDB?

MusKaya avatar Mar 13 '24 19:03 MusKaya

@ale-rinaldi would you mind adding this info to our docs?

ion-elgreco avatar Apr 06 '24 00:04 ion-elgreco

@ale-rinaldi would you mind adding this info to our docs?

@ion-elgreco you are not referring to this right? Right now we have a real use case for what I have described above (using different credentials for s3 and dynamodb) and I created #2287 for it. If it is already supported it would be great to have the documentation clarify it. Otherwise we need to accommodate separate set of credentials for dynamodb to unblock uncoupling dynamodb from s3.

MusKaya avatar Apr 06 '24 19:04 MusKaya

@ion-elgreco of course! I opened https://github.com/delta-io/delta-rs/pull/2393

ale-rinaldi avatar Apr 06 '24 19:04 ale-rinaldi

Experiencing some issues that may be related to this.

I set up a DynamoDB table using the following command:

aws dynamodb create-table \
    --table-name delta_rs_lock_table \
    --attribute-definitions AttributeName=key,AttributeType=S \
    --key-schema AttributeName=key,KeyType=HASH \
    --billing-mode PAY_PER_REQUEST

And running following example

import boto3
import pandas as pd

from deltalake import DeltaTable
from deltalake import writer

credentials = boto3.Session().get_credentials().get_frozen_credentials()

storage_options = {
    "AWS_ACCESS_KEY_ID": credentials.access_key,
    "AWS_SECRET_ACCESS_KEY": credentials.secret_key,
    "AWS_SESSION_TOKEN": credentials.token,
    "AWS_REGION": "us-east-1",
    "AWS_S3_LOCKING_PROVIDER": "dynamodb",
    "DYNAMO_LOCK_PARTITION_KEY_VALUE": "key",
    "DYNAMO_LOCK_TABLE_NAME": "delta_rs_lock_table",
}

df = pd.DataFrame(
    {"x": [1, 2, 3]},
)

output = f"s3://{bucket}/some_delta_lake"
writer.write_deltalake(output, df, storage_options=storage_options)

I receive the following error when running...

[2024-06-03T16:02:48Z ERROR deltalake_aws::logstore] dynamodb client failed to write log entry: GenericDynamoDb { source: Unhandled(Unhandled { source: ErrorMetadata { code: Some("ValidationException"), message: Some("One or more parameter values were invalid: Missing the key key in the item"), extras: Some({"aws_request_id": "******"}) }, meta: ErrorMetadata { code: Some("ValidationException"), message: Some("One or more parameter values were invalid: Missing the key key in the item"), extras: Some({"aws_request_id": "******"}) } }) }

Looking at the policies assigned to my AWS account, it seems that I have all the permissions/policies that have been discussed above.

Not sure what I am missing.

kwodzicki avatar Jun 03 '24 16:06 kwodzicki

In the published documentation they specify the create-table command as:

aws dynamodb create-table \
    --table-name delta_log \
    --attribute-definitions AttributeName=tablePath,AttributeType=S AttributeName=fileName,AttributeType=S \
    --key-schema AttributeName=tablePath,KeyType=HASH AttributeName=fileName,KeyType=RANGE \
    --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5
  • https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/#dynamodb

dhirschfeld avatar Jun 03 '24 21:06 dhirschfeld

Thank you @dhirschfeld, this solved my issue.

kwodzicki avatar Jun 28 '24 14:06 kwodzicki