alluxio GET Request fails when using Alluxio S3 API, same request succeeds when AWS S3 API is used directly

Alluxio Version: What version of Alluxio are you using? enterprise-2.8.0-2.0

Describe the bug A clear and concise description of what the bug is.

S3 GET request fails with a 404 response when using endpoint. Similar request succeeds when S3 API is used directly

OK Response from S3 API

- http-outgoing-1 >>  GET /REPLACE_BUCKET_NAME/?list-type=2&delimiter=%2F&max-keys=2&prefix=REPLACE_WITH_PREFIX%2F&fetch-owner=false HTTP/1.1
- http-outgoing-1 >>  Host: s3.us-west-2.amazonaws.com
- http-outgoing-1 >>  amz-sdk-invocation-id: ...
- http-outgoing-1 >>  amz-sdk-request: ...
- http-outgoing-1 >>  amz-sdk-retry: 0/0/500
- http-outgoing-1 >>  Authorization: ...
- http-outgoing-1 >>  Content-Type: application/octet-stream
- http-outgoing-1 >>  Content-Length: 0
- http-outgoing-1 >>  Connection: Keep-Alive
- http-outgoing-1 >>  [\r][\n] 
- http-outgoing-1 <<  HTTP/1.1 200 OK
- http-outgoing-1 <<  x-amz-id-2: ...
- http-outgoing-1 <<  x-amz-request-id: ..
- http-outgoing-1 <<  Date: Thu, 07 Jul 2022 05:46:22 GMT
- http-outgoing-1 <<  x-amz-bucket-region: ...
- http-outgoing-1 <<  Content-Type: application/xml
- http-outgoing-1 <<  Transfer-Encoding: chunked
- http-outgoing-1 <<  Server: AmazonS3
- http-outgoing-1 <<  [\r][\n] 
- http-outgoing-1 << *HTTP/1.1 200 OK*

404 Response from Alluxio S3 API For Same Request

- http-outgoing-0 >>  GET /api/v1/s3/REPLACE_BUCKET_NAME/?list-type=2&delimiter=%2F&max-keys=2&prefix=REPLACE_WITH_PREFIX%2F&fetch-owner=false HTTP/1.1[\r][\n] 
- http-outgoing-0 >>  Host: api.g.....com:39999[\r][\n] 
- http-outgoing-0 >>  amz-sdk-invocation-id: ...
- http-outgoing-0 >>  amz-sdk-request: ...
- http-outgoing-0 >>  amz-sdk-retry: 0/0/500 
- http-outgoing-0 >>  Authorization: ... 
- http-outgoing-0 >>  Content-Type: application/octet-stream
- http-outgoing-0 >>  Content-Length: 0
- http-outgoing-0 >>  Connection: Keep-Alive
- http-outgoing-0 >>  [\r][\n] 
- http-outgoing-0 <<  HTTP/1.1 404 Not Found
- http-outgoing-0 <<  Date: Thu, 07 Jul 2022 05:48:00 GMT[\r][\n] 
- http-outgoing-0 <<  Content-Type: application/xml[\r][\n] 
- http-outgoing-0 <<  Content-Length: 196[\r][\n] 
- http-outgoing-0 <<  Server: Jetty(9.4.43.v20210629)[\r][\n] 
- http-outgoing-0 <<  [\r][\n] 
- http-outgoing-0 <<  <Error><RequestId></RequestId><Code>NoSuchBucket</Code><Message>Path  /REPLACE_BUCKET_NAME/REPLACE_WITH_PREFIX  does not exist.</Message><Resource>REPLACE_BUCKET_NAME</Resource></Error> 
- http-outgoing-0 << *HTTP/1.1 404 Not Found*

To Reproduce Steps to reproduce the behavior (as minimally and precisely as possible)

create a 1:1 mapped s3 bucket mount which is such that: s3://some_bucket on alluxio === s3://some_bucket on s3

Try to Expected behavior Expect a similar response, 200 OK as returned from S3 API

Urgency Critical, blocks creation of files from Spark

Are you planning to fix it TBD

Additional context Trying to use Alluxio to read/write data from Spark. Writing some data from Spark works fine when s3a endpoint is not used, fails when Alluxio S3 API endpoint is used due to above 404 response.

Jul 07 '22 06:07 abmo-x

@abmo-x thanks for reporting. @ZhuTopher can you take a look?

Jul 07 '22 23:07 HelloHorizon

@abmo-x I believe this issue has been resolved by this PR #15965.

I tried to replicate your setup with EMR as the Hadoop & Spark cluster. You can see my reproduction steps here.

I used an S3 root UFS mount rather than a sub-directory mount.

Aug 01 '22 18:08 ZhuTopher

@ZhuTopher Thanks for the update! looking forwarding to the new release.

Aug 01 '22 18:08 abmo-x

@abmo-x Would you be able to provide more details about the Access Key you provided and maybe the request code?

eg: if it is a Spark job then something like:

val hadoop_conf = spark.sparkContext.hadoopConfiguration
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.access.key", "alluxio")
hadoop_conf.set("fs.s3a.secret.key", "dummy value")
hadoop_conf.set("fs.s3a.path.style.access", "true")
hadoop_conf.set("fs.s3a.endpoint", "http://{ALLUXIO_MASTER_PRIVATE_IP}:39999/api/v1/s3")

val df = spark.read.csv("s3a://testbucket/customers.csv")
df.show()
df.rdd.saveAsTextFile("s3a://testbucket/copy.csv")

Also please provide the logs in logs/proxy.log.

Aug 16 '22 21:08 ZhuTopher

@abmo-x I believe this issue has been resolved by this PR #15965.

I tried to replicate your setup with EMR as the Hadoop & Spark cluster. You can see my reproduction steps here.

I used an S3 root UFS mount rather than a sub-directory mount.

This hasn't been resolved by that above PR, instead here is a PR which should fix this: https://github.com/Alluxio/alluxio/pull/16074

For some reason, the bundled EMR Spark behaves differently than the open-source Apache Spark and so I couldn't repro this using my EMR environment

Aug 19 '22 19:08 ZhuTopher

alluxio alluxio copied to clipboard

GET Request fails when using Alluxio S3 API, same request succeeds when AWS S3 API is used directly

OK Response from S3 API

404 Response from Alluxio S3 API For Same Request

alluxio
alluxio copied to clipboard