aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

`wr.s3.to_deltalake` throwing TypeError about storage_options

Open leodido opened this issue 1 year ago • 14 comments

Describe the bug

Calling we.s3.to_deltalake() throws the following error:

self._table = RawDeltaTable(
                  ^^^^^^^^^^^^^^
TypeError: argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'

How to Reproduce

wr.s3.to_deltalake(
            df=data,
            path="s3://bucket/delta",
            index=False,
            partition_cols=["a", "b"],
            overwrite_schema=True,
            s3_additional_kwargs={
                "AWS_ACCESS_KEY_ID": "...",
                "AWS_SECRET_ACCESS_KEY": "...",
                "AWS_REGION": "eu-west-1",
            },
            s3_allow_unsafe_rename=True,
        )

Expected behavior

I'd expect awswrangler to connect to S3 and write the delta table.

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.11.4

AWS SDK for pandas version

3.2.1

Additional context

No response

leodido avatar Jul 03 '23 22:07 leodido

The s3_additional_kwargs argument is for passing S3 specific arguments like ServerSideEncryption, not your AWS credentials. The boto3 session is used to extract the credentials and the region. So as long as that is correctly configured and passed, it should be enough:

boto3_session = boto3.Session(region_name="eu-west-1")
wr.s3.to_deltalake(path=path, df=df, boto3_session=boto3_session, s3_additional_kwargs={"ServerSideEncryption": "AES256"})

jaidisido avatar Jul 03 '23 22:07 jaidisido

Did it ...

     boto3_session = boto3.Session(region_name="eu-west-1") # Yes, the region is correct
      wrangler.s3.to_deltalake(
          df=data,
          path="s3://mybucket/delta", # Yes, the bucket exists
          index=False,
          partition_cols=["a", "b"], 
          overwrite_schema=True,
          boto3_session=boto3_session,
          s3_allow_unsafe_rename=True,
      )

But I keep getting the same error:

TypeError: argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'

I enabled logging (INFO level) with:

logging.basicConfig(level=logging.INFO, format="[%(name)s][%(funcName)s] %(message)s")
logging.getLogger("awswrangler").setLevel(logging.INFO)

And I see this in logs just before the error above...

[botocore.credentials][load] Found credentials in environment variables.
on 0: {'AWS_ACCESS_KEY_ID': 'REDACT', 'AWS_SECRET_ACCESS_KEY': 'REDACT', 'AWS_SESSION_TOKEN': None, 'PROFILE_NAME': 'default', 'AWS_REGION': 'eu-west-1', 'AWS_S3_ALLOW_UNSAFE_RENAME': 'TRUE'}

Any further suggestions on how to fix this?

Here's the error trace:
Traceback (most recent call last):
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/v2p.py", line 304, in <module>
    process(chunk)
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/v2p.py", line 255, in process
    wrangler.s3.to_deltalake(
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/awswrangler/_utils.py", line 122, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/awswrangler/annotations.py", line 44, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/awswrangler/s3/_write_deltalake.py", line 104, in to_deltalake
    deltalake.write_deltalake(
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/deltalake/writer.py", line 147, in write_deltalake
    table, table_uri = try_get_table_and_table_uri(table_or_uri, storage_options)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/deltalake/writer.py", line 392, in try_get_table_and_table_uri
    table = try_get_deltatable(table_or_uri, storage_options)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/deltalake/writer.py", line 405, in try_get_deltatable
    return DeltaTable(table_uri, storage_options=storage_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/deltalake/table.py", line 122, in __init__
    self._table = RawDeltaTable(

leodido avatar Jul 05 '23 23:07 leodido

Hmm strange, I am unable to replicate this error on my local:

boto3_session = boto3.Session(region_name="us-east-1")
df = pd.DataFrame({"c0": [1, 2, 3], "c1": [True, False, True], "par0": ["foo", "foo", "bar"], "par1": [1, 2, 2]})
wr.s3.to_deltalake(
    path=path,
    df=df,
    index=False,
    boto3_session=boto3_session,
    partition_cols=["par0", "par1"],
    overwrite_schema=True,
    s3_allow_unsafe_rename=True,
)
df2 = wr.s3.read_deltalake(path=path, columns=["c0"], partitions=[("par0", "=", "foo"), ("par1", "=", "1")])
assert df2.shape == (1, 1)

works fine.

Could you share your pip freeze? I imagine you are on deltalake 0.9.0?

Also please try to make the call directly using the deltalake library, which is pretty much what we are doing. At which point it might be worth opening an issue in delta-rs directly.

jaidisido avatar Jul 06 '23 09:07 jaidisido

I recive the same error here:

wr.s3.to_deltalake( df=df_delta, path=s3_path, mode="overwrite", partition_cols = partition_cols, index=False, overwrite_schema=True, s3_allow_unsafe_rename=True, )

argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'

leandro-ferreira-farm avatar Oct 06 '23 05:10 leandro-ferreira-farm

Please, reeopen this ticket. This issue continues to happen even in version 3.3.0

leandro-ferreira-farm avatar Oct 06 '23 13:10 leandro-ferreira-farm

I saw that the error happens when we use a boto.session. I don't know if it's a deltalake-rs issue or an awswrangler issue

leandro-ferreira-farm avatar Oct 06 '23 13:10 leandro-ferreira-farm

I'm using poetry as a dependencies manager, and my pyproject.tom is:

[tool.poetry.dependencies] python = "3.10.11" awswrangler = "3.3.0" boto3 = "1.27.1" pyarrow = "12.0.1" duckdb = "0.8.1" pandas = "2.0.3" deltalake = "0.10.0" jsonschema = "4.18.0" requests = "2.31.0" pyyaml = "6.0.1" ipykernel = "6.24.0" pyspark = "3.4.0" delta-spark = "2.4.0" sagemaker = "2.72" findspark = "2.0.1" msal = "1.22.0" great-expectations = "0.17.7" hvac = "1.1.1"

leandro-ferreira-farm avatar Oct 06 '23 13:10 leandro-ferreira-farm

I gently ask to reopen this issue, I am facing the very same problem. I'm using version 3.4.1.

luis-fnogueira avatar Nov 13 '23 20:11 luis-fnogueira

I am facing this issue with wr.s3.read_deltalake as follows:

df = wr.s3.read_deltalake(path=label_path, columns=[label_field], boto3_session=session)

ZulqarnainB avatar Dec 20 '23 13:12 ZulqarnainB

This issue can occur when no region is associated with the profile in ~/.aws/config Running aws configure and providing a default region fixes this.

to_deltalake() pulls a None value from the boto3_session which then can't be cast to a PyString as the exception shows.

@leodido In your case, it could be the AWS_SESSION_TOKEN missing instead, considering the log you posted.

[botocore.credentials][load] Found credentials in environment variables. on 0: {'AWS_ACCESS_KEY_ID': 'REDACT', 'AWS_SECRET_ACCESS_KEY': 'REDACT', 'AWS_SESSION_TOKEN': None, 'PROFILE_NAME': 'default', 'AWS_REGION': 'eu-west-1', 'AWS_S3_ALLOW_UNSAFE_RENAME': 'TRUE'}

neverlink avatar Jan 28 '24 12:01 neverlink

I had this problem too. It looks to me like the 'AWS_SESSION_TOKEN' being None is clashing with the constructor expecting a dictionary of strings. (Perhaps the underlying C++ code does not handle None for string??)

I tried this workaround in the init function of the DeltaTable class, prior to the creation of the DataRawTable:

        # replace None with empty string
        if 'AWS_SESSION_TOKEN' in storage_options and storage_options['AWS_SESSION_TOKEN'] is None:
            storage_options['AWS_SESSION_TOKEN'] = ''

It stopped the type error and having a blank value in the AWS_SESSION_TOKEN did not cause a problem, as the write completed without error.

stuart-powell avatar Mar 06 '24 05:03 stuart-powell

I've had same error in AWS environment while locally everything was working fine. Got it fixed by adding before calling to_deltalake:

boto3.setup_default_session(region_name='us-east-1')

vavaan avatar Mar 06 '24 08:03 vavaan

Thanks, vavaan, for your update. That didn't work for me. Before calling to_deltalake I currently have:

boto3_session = boto3.Session(region_name="ap-southeast-2")

I tried changing this to use boto3.setup_default_session(...) but it was the same either way. When the call is made to set up the DataRawTable the AWS_SESSION_TOKEN is still set to None and the conversion to PyString fails. In my program this is my first use of the S3 session, so maybe if you've already done something with the session before trying to write, the token will have been set to a non-None value and it works. But it doesn't work for me even if I've created the session as shown above.

Thanks again for your response.

stuart-powell avatar Mar 06 '24 10:03 stuart-powell

I just set the unused AWS_SESSION_TOKEN to an empty string to override it from being set to None when calling to_deltalake to fix the issue:

boto3_session = boto3.Session(region_name="us-east-1")
df = pd.DataFrame({"c0": [1, 2, 3], "c1": [True, False, True], "par0": ["foo", "foo", "bar"], "par1": [1, 2, 2]})
wr.s3.to_deltalake(
    path=path,
    df=df,
    index=False,
    boto3_session=boto3_session,
    partition_cols=["par0", "par1"],
    overwrite_schema=True,
    s3_allow_unsafe_rename=True,
    s3_additional_kwargs={
        'AWS_SESSION_TOKEN': ''
    }
)

tposlins avatar Apr 22 '24 16:04 tposlins

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

github-actions[bot] avatar Jun 21 '24 18:06 github-actions[bot]