aws-sdk-pandas
aws-sdk-pandas copied to clipboard
`wr.s3.to_deltalake` throwing TypeError about storage_options
Describe the bug
Calling we.s3.to_deltalake()
throws the following error:
self._table = RawDeltaTable(
^^^^^^^^^^^^^^
TypeError: argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'
How to Reproduce
wr.s3.to_deltalake(
df=data,
path="s3://bucket/delta",
index=False,
partition_cols=["a", "b"],
overwrite_schema=True,
s3_additional_kwargs={
"AWS_ACCESS_KEY_ID": "...",
"AWS_SECRET_ACCESS_KEY": "...",
"AWS_REGION": "eu-west-1",
},
s3_allow_unsafe_rename=True,
)
Expected behavior
I'd expect awswrangler
to connect to S3 and write the delta table.
Your project
No response
Screenshots
No response
OS
Mac
Python version
3.11.4
AWS SDK for pandas version
3.2.1
Additional context
No response
The s3_additional_kwargs
argument is for passing S3 specific arguments like ServerSideEncryption
, not your AWS credentials. The boto3 session is used to extract the credentials and the region. So as long as that is correctly configured and passed, it should be enough:
boto3_session = boto3.Session(region_name="eu-west-1")
wr.s3.to_deltalake(path=path, df=df, boto3_session=boto3_session, s3_additional_kwargs={"ServerSideEncryption": "AES256"})
Did it ...
boto3_session = boto3.Session(region_name="eu-west-1") # Yes, the region is correct
wrangler.s3.to_deltalake(
df=data,
path="s3://mybucket/delta", # Yes, the bucket exists
index=False,
partition_cols=["a", "b"],
overwrite_schema=True,
boto3_session=boto3_session,
s3_allow_unsafe_rename=True,
)
But I keep getting the same error:
TypeError: argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'
I enabled logging (INFO level) with:
logging.basicConfig(level=logging.INFO, format="[%(name)s][%(funcName)s] %(message)s")
logging.getLogger("awswrangler").setLevel(logging.INFO)
And I see this in logs just before the error above...
[botocore.credentials][load] Found credentials in environment variables.
on 0: {'AWS_ACCESS_KEY_ID': 'REDACT', 'AWS_SECRET_ACCESS_KEY': 'REDACT', 'AWS_SESSION_TOKEN': None, 'PROFILE_NAME': 'default', 'AWS_REGION': 'eu-west-1', 'AWS_S3_ALLOW_UNSAFE_RENAME': 'TRUE'}
Any further suggestions on how to fix this?
Here's the error trace:
Traceback (most recent call last):
File "/Users/leodido/Workspace/github.com/REDACTED/v2p/v2p.py", line 304, in <module>
process(chunk)
File "/Users/leodido/Workspace/github.com/REDACTED/v2p/v2p.py", line 255, in process
wrangler.s3.to_deltalake(
File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/awswrangler/_utils.py", line 122, in inner
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/awswrangler/annotations.py", line 44, in inner
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/awswrangler/s3/_write_deltalake.py", line 104, in to_deltalake
deltalake.write_deltalake(
File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/deltalake/writer.py", line 147, in write_deltalake
table, table_uri = try_get_table_and_table_uri(table_or_uri, storage_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/deltalake/writer.py", line 392, in try_get_table_and_table_uri
table = try_get_deltatable(table_or_uri, storage_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/deltalake/writer.py", line 405, in try_get_deltatable
return DeltaTable(table_uri, storage_options=storage_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/deltalake/table.py", line 122, in __init__
self._table = RawDeltaTable(
Hmm strange, I am unable to replicate this error on my local:
boto3_session = boto3.Session(region_name="us-east-1")
df = pd.DataFrame({"c0": [1, 2, 3], "c1": [True, False, True], "par0": ["foo", "foo", "bar"], "par1": [1, 2, 2]})
wr.s3.to_deltalake(
path=path,
df=df,
index=False,
boto3_session=boto3_session,
partition_cols=["par0", "par1"],
overwrite_schema=True,
s3_allow_unsafe_rename=True,
)
df2 = wr.s3.read_deltalake(path=path, columns=["c0"], partitions=[("par0", "=", "foo"), ("par1", "=", "1")])
assert df2.shape == (1, 1)
works fine.
Could you share your pip freeze? I imagine you are on deltalake 0.9.0?
Also please try to make the call directly using the deltalake library, which is pretty much what we are doing. At which point it might be worth opening an issue in delta-rs directly.
I recive the same error here:
wr.s3.to_deltalake( df=df_delta, path=s3_path, mode="overwrite", partition_cols = partition_cols, index=False, overwrite_schema=True, s3_allow_unsafe_rename=True, )
argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'
Please, reeopen this ticket. This issue continues to happen even in version 3.3.0
I saw that the error happens when we use a boto.session. I don't know if it's a deltalake-rs issue or an awswrangler issue
I'm using poetry as a dependencies manager, and my pyproject.tom is:
[tool.poetry.dependencies] python = "3.10.11" awswrangler = "3.3.0" boto3 = "1.27.1" pyarrow = "12.0.1" duckdb = "0.8.1" pandas = "2.0.3" deltalake = "0.10.0" jsonschema = "4.18.0" requests = "2.31.0" pyyaml = "6.0.1" ipykernel = "6.24.0" pyspark = "3.4.0" delta-spark = "2.4.0" sagemaker = "2.72" findspark = "2.0.1" msal = "1.22.0" great-expectations = "0.17.7" hvac = "1.1.1"
I gently ask to reopen this issue, I am facing the very same problem. I'm using version 3.4.1.
I am facing this issue with wr.s3.read_deltalake as follows:
df = wr.s3.read_deltalake(path=label_path, columns=[label_field], boto3_session=session)
This issue can occur when no region is associated with the profile in ~/.aws/config
Running aws configure
and providing a default region fixes this.
to_deltalake() pulls a None
value from the boto3_session
which then can't be cast to a PyString
as the exception shows.
@leodido In your case, it could be the AWS_SESSION_TOKEN
missing instead, considering the log you posted.
[botocore.credentials][load] Found credentials in environment variables. on 0: {'AWS_ACCESS_KEY_ID': 'REDACT', 'AWS_SECRET_ACCESS_KEY': 'REDACT', 'AWS_SESSION_TOKEN': None, 'PROFILE_NAME': 'default', 'AWS_REGION': 'eu-west-1', 'AWS_S3_ALLOW_UNSAFE_RENAME': 'TRUE'}
I had this problem too. It looks to me like the 'AWS_SESSION_TOKEN' being None is clashing with the constructor expecting a dictionary of strings. (Perhaps the underlying C++ code does not handle None for string??)
I tried this workaround in the init function of the DeltaTable class, prior to the creation of the DataRawTable:
# replace None with empty string
if 'AWS_SESSION_TOKEN' in storage_options and storage_options['AWS_SESSION_TOKEN'] is None:
storage_options['AWS_SESSION_TOKEN'] = ''
It stopped the type error and having a blank value in the AWS_SESSION_TOKEN did not cause a problem, as the write completed without error.
I've had same error in AWS environment while locally everything was working fine. Got it fixed by adding before calling to_deltalake
:
boto3.setup_default_session(region_name='us-east-1')
Thanks, vavaan, for your update. That didn't work for me. Before calling to_deltalake
I currently have:
boto3_session = boto3.Session(region_name="ap-southeast-2")
I tried changing this to use boto3.setup_default_session(...)
but it was the same either way. When the call is made to set up the
DataRawTable the AWS_SESSION_TOKEN is still set to None and the conversion to PyString fails. In my program this is my first use of the S3 session, so maybe if you've already done something with the session before trying to write, the token will have been set to a non-None value and it works. But it doesn't work for me even if I've created the session as shown above.
Thanks again for your response.
I just set the unused AWS_SESSION_TOKEN to an empty string to override it from being set to None when calling to_deltalake to fix the issue:
boto3_session = boto3.Session(region_name="us-east-1")
df = pd.DataFrame({"c0": [1, 2, 3], "c1": [True, False, True], "par0": ["foo", "foo", "bar"], "par1": [1, 2, 2]})
wr.s3.to_deltalake(
path=path,
df=df,
index=False,
boto3_session=boto3_session,
partition_cols=["par0", "par1"],
overwrite_schema=True,
s3_allow_unsafe_rename=True,
s3_additional_kwargs={
'AWS_SESSION_TOKEN': ''
}
)
Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.