delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

AWS environment variables seems to have more priority over storage_options

Open emanueledomingo opened this issue 2 years ago • 5 comments

Environment

Delta-rs version: 0.11.0

Binding: Python

Environment:

  • Cloud provider: AWS
  • OS: Ubuntu 22.04 LTS

Bug

What happened:

I'm not able to write on AWS S3 using AssumeRole. Same code is working with deltalake 0.10.1.

What you expected to happen:

Write to AWS S3 using credentials in storage_options instead of environment ones

How to reproduce it:

import deltalake as dl
import pyarrow as pa  # 12.0.0
import boto3

aws_profile = "..."
aws_region = "..."
aws_role_arn = "..."

session = boto3.Session(profile_name=aws_profile, region_name=aws_region")
sts = session.client("sts")
cred = sts.assume_role(RoleArn=aws_role_arn, RoleSessionName="TestSession")["Credentials"]

pa_table = pa.Table.from_pylist([{'n_legs': 2, 'animals': 'Flamingo'}, {'year': 2021, 'animals': 'Centipede'}])
dt_table_uri = "..."  # s3://my-bucket/test/

storage_options = {
    "AWS_S3_ALLOW_UNSAFE_RENAME": "true",
    "AWS_ACCESS_KEY_ID": cred["AccessKeyId"],
    "AWS_SECRET_ACCESS_KEY": cred["SecretAccessKey"],
    "AWS_SESSION_TOKEN": cred["SessionToken"],
    "AWS_REGION": aws_region,
}

os.environ["AWS_ACCESS_KEY_ID"] = "WRONG_TOKEN"
os.environ["AWS_SECRET_ACCESS_KEY"] = "WRONG_TOKEN"
os.environ["AWS_SESSION_TOKEN"] = "WRONG_TOKEN"

dl.write_deltalake(
    table_or_uri=dt_table_uri,
    data=pa_table,
    mode="overwrite",
    storage_options=storage_options,
)

More details:

OSError
in try_get_table_and_table_uri(table_or_uri, storage_options)
    394     raise ValueError("table_or_uri must be a str, Path or DeltaTable")
    396 if isinstance(table_or_uri, (str, Path)):
--> 397     table = try_get_deltatable(table_or_uri, storage_options)
    398     table_uri = str(table_or_uri)
    399 else:

in try_get_deltatable(table_uri, storage_options)
    406 def try_get_deltatable(
    407     table_uri: Union[str, Path], storage_options: Optional[Dict[str, str]]
    408 ) -> Optional[DeltaTable]:
    409     try:
--> 410         return DeltaTable(table_uri, storage_options=storage_options)
    411     except TableNotFoundError:
    412         return None

in DeltaTable.__init__(self, table_uri, version, storage_options, without_files, log_buffer_size)
    228 """
    229 Create the Delta Table from a path with an optional version.
    230 Multiple StorageBackends are currently supported: AWS S3, Azure Data Lake Storage Gen2, Google Cloud Storage (GCS) and local URI.
   (...)
    244 
    245 """
    246 self._storage_options = storage_options
--> 247 self._table = RawDeltaTable(
    248     str(table_uri),
    249     version=version,
    250     storage_options=storage_options,
    251     without_files=without_files,
    252     log_buffer_size=log_buffer_size,
    253 )
    254 self._metadata = Metadata(self._table)

OSError: Generic S3 error: response error "<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidAccessKeyId</Code><Message>The AWS Access Key Id you provided does not exist in our records.</Message><AWSAccessKeyId>WRONG_TOKEN</AWSAccessKeyId><RequestId>XXX</RequestId><HostId>XXX</HostId></Error>", after 0 retries: HTTP status client error (403 Forbidden) for url (XXX)

Delta tries to connect using the environment variables instead of the storage options.

emanueledomingo avatar Oct 04 '23 13:10 emanueledomingo

I haven't deep-dived too much into rust code as I'm still a beginner rustacean, but it seems to me that the issue arises from the following lines: https://github.com/delta-io/delta-rs/blob/4da7d66d06985d386e61d3fb124cad6680594dcc/rust/src/storage/config.rs#L163-L176

The comparison between the env key and the storage_options key is done on the lowered version of the env key, resulting in bugs like the one @emanueledomingo reported.

giacomorebecchi avatar Oct 04 '23 15:10 giacomorebecchi

Given the above comment, a temporary workaround is to pass already-lowered keys in the storage_options

giacomorebecchi avatar Oct 04 '23 15:10 giacomorebecchi

Looks like it was introduced here: https://github.com/delta-io/delta-rs/commit/02b3cea58505212454509828660dd534eb2ba6e6.

A possible fix would be using a case-insensitive hash map unless there is a scenario when key case is important (I don't see one). If case insensitive hash map is not a good idea then changing the logic in with... methods to loop through keys performing case-insensitive comparison instead of using contains_key

r3stl355 avatar Oct 07 '23 16:10 r3stl355

I haven't checked the codebase fully, but did some work on that recently. Specifically the S3 code is still quite messy. IIRC the priority issue should actually be that if there are multiple credentials available the priority is based on the order in which object store checks credentials.

Unfortunately we cannot just try to get a credential, since there are configuration-less credentials (i.e. metadata endpoint) which means the logic to check if there is a valid credential w/o the environment is a bit more involved. I tried once, but ended up "guessing" too much what the user intend may be, while also not being too familiar with how AWS does things.

That said, we definitely need to do some work on the s3 config...

roeap avatar Oct 17 '23 20:10 roeap

removed, different issue, unrelated

dispanser avatar Dec 12 '23 09:12 dispanser

I stopped not getting this error in deltalake 0.16.1.

NikoJ avatar Mar 19 '24 15:03 NikoJ

Yes i confrim that also on 0.16.0 the error does not occur. This issue can be closed.

emanueledomingo avatar Mar 22 '24 16:03 emanueledomingo