datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

`datafusion-cli`: Use correct S3 region if it is not specified

Open alamb opened this issue 6 months ago • 5 comments

Is your feature request related to a problem or challenge?

  • Part of https://github.com/apache/datafusion/issues/13456

I would like to make it easy to use datafusion-cli to query files on S3 as possible

For example, after https://github.com/apache/datafusion/issues/16299 is merged I would like to be able to read from the ClickBench example datasets:

CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet';

However, when I run this I get the following error:

> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet';
Object Store error: Generic S3 error: Error performing HEAD https://s3.us-east-1.amazonaws.com/clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet in 499.73175ms - Received redirect without LOCATION, this normally indicates an incorrectly configured region

This does give me the hint that the region is incorrectly configured which is good, however, it doesn't tell me "WHAT" region I need

If I provide the correct region (eu-central-1) it works great:

> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet' OPTIONS ('aws.region' 'eu-central-1');
0 row(s) fetched.
Elapsed 1.182 seconds.

> select count(*) from hits;
+----------+
| count(*) |
+----------+
| 1000000  |
+----------+
1 row(s) fetched.
Elapsed 0.780 seconds.

I noticed that that DuckDB and ClickHouse do not require the region to be set:

v1.2.2 7c039464e4
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select count(*) from read_parquet('s3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet');
┌────────────────┐
│  count_star()  │
│     int64      │
├────────────────┤
│    1000000     │
│ (1.00 million) │
└────────────────┘

Describe the solution you'd like

I would like datafusion-cli to automatically find the region as well

I did some investigation and the correct region is returned via a response header, which you can see via

curl -v -X HEAD https://s3.us-east-1.amazonaws.com/clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet
...
...
> HEAD /clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet HTTP/1.1
> Host: s3.us-east-1.amazonaws.com
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 301 Moved Permanently
< x-amz-bucket-region: eu-central-1
< x-amz-request-id: Q44G0APVQH5JHHC4
< x-amz-id-2: cubLiiba/Q138g5SbNNlSoGtARMxobuq7GhA+3t39il+Wj50HNPBUh4bOGVS2Bwlc6k4f0lp6r0=
< Content-Type: application/xml
< Transfer-Encoding: chunked
< Date: Fri, 06 Jun 2025 14:19:57 GMT
< Server: AmazonS3

Note the x-amz-bucket-region in the response:

< x-amz-bucket-region: eu-central-1

I suspect this will need some change upstream in the object_store crate and I will work on filing an upstream ticket now

Describe alternatives you've considered

No response

Additional context

Upstream ticket

  • https://github.com/apache/arrow-rs-object-store/issues/402

alamb avatar Jun 06 '25 14:06 alamb

  • Filed https://github.com/apache/arrow-rs-object-store/issues/402

alamb avatar Jun 06 '25 14:06 alamb

take

liamzwbao avatar Jun 13 '25 00:06 liamzwbao

Hi @alamb, from the upstream ticket, I think we can use resolve_bucket_region to get the region if it's not specified.

However, I'm wondering what should be the expected behavior if user explicitly specify the region. Here are several options:

  1. Do not resolve bucket region if it's provided already. This would fail the request if user set the wrong region (may not be a good user experience)
  2. Always resolve the bucket to get the right region. This would make the region config useless and probably have performance issue? (2 requests on average)
  3. Retry upon the region specified by user is wrong. (1 request for correct region, 3 requests for wrong region)

Would appreciate your thoughts on this. Thank you!

liamzwbao avatar Jun 13 '25 01:06 liamzwbao

Retry upon the region specified by user is wrong. (1 request for correct region, 3 requests for wrong region)

This would be my preferred behavior for datafusion-cli

And maybe we can add a WARN (log::warn) style message / hint that says the region was incorrect and datafusion-cli is finding the correct one automatically, but that this is slow

alamb avatar Jun 13 '25 13:06 alamb

Thank you for starting to look into it @liamzwbao

alamb avatar Jun 13 '25 13:06 alamb