aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

Data is not being totally written with append using awswrangler to_deltalake with multiple lambdas running in parallel

Open camposvinicius opened this issue 1 year ago • 1 comments

Describe the bug

We created an empty delta table with pyspark so that it can be appended with awswrangler's to_deltalake method with lambdas running in parallel. But when we look at cloudwatch there is no error, and some data is written and others are not, without there really being any error.

How to Reproduce

wr.s3.to_deltalake(
            df=data,
            path="s3://bucket/delta",
            index=False,
            partition_cols=["a", "b"],
            overwrite_schema=False,
            s3_additional_kwargs={
                "AWS_ACCESS_KEY_ID": "...",
                "AWS_SECRET_ACCESS_KEY": "...",
                "AWS_REGION": "eu-west-1",
            },
            s3_allow_unsafe_rename=True,
            mode='append'
        )

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.8

AWS SDK for pandas version

3.6.0

Additional context

No response

camposvinicius avatar Apr 11 '24 09:04 camposvinicius

Hey,

When s3_allow_unsafe_rename is set to True, consistency will not be enforced between different simultaneous write operations. In order to make use of the locking mechanism, a DynamoDB table needs to be created and passed using the lock_dynamodb_table argument. More details can be found in the to_deltalake documentation.

Best regards, Leon

LeonLuttenberger avatar May 15 '24 18:05 LeonLuttenberger

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

github-actions[bot] avatar Jul 16 '24 12:07 github-actions[bot]