alluxio
alluxio copied to clipboard
GET Request fails when using Alluxio S3 API, same request succeeds when AWS S3 API is used directly
Alluxio Version: What version of Alluxio are you using? enterprise-2.8.0-2.0
Describe the bug A clear and concise description of what the bug is.
S3 GET request fails with a 404 response when using endpoint. Similar request succeeds when S3 API is used directly
OK Response from S3 API
- http-outgoing-1 >> GET /REPLACE_BUCKET_NAME/?list-type=2&delimiter=%2F&max-keys=2&prefix=REPLACE_WITH_PREFIX%2F&fetch-owner=false HTTP/1.1
- http-outgoing-1 >> Host: s3.us-west-2.amazonaws.com
- http-outgoing-1 >> amz-sdk-invocation-id: ...
- http-outgoing-1 >> amz-sdk-request: ...
- http-outgoing-1 >> amz-sdk-retry: 0/0/500
- http-outgoing-1 >> Authorization: ...
- http-outgoing-1 >> Content-Type: application/octet-stream
- http-outgoing-1 >> Content-Length: 0
- http-outgoing-1 >> Connection: Keep-Alive
- http-outgoing-1 >> [\r][\n]
- http-outgoing-1 << HTTP/1.1 200 OK
- http-outgoing-1 << x-amz-id-2: ...
- http-outgoing-1 << x-amz-request-id: ..
- http-outgoing-1 << Date: Thu, 07 Jul 2022 05:46:22 GMT
- http-outgoing-1 << x-amz-bucket-region: ...
- http-outgoing-1 << Content-Type: application/xml
- http-outgoing-1 << Transfer-Encoding: chunked
- http-outgoing-1 << Server: AmazonS3
- http-outgoing-1 << [\r][\n]
- http-outgoing-1 << *HTTP/1.1 200 OK*
404 Response from Alluxio S3 API For Same Request
- http-outgoing-0 >> GET /api/v1/s3/REPLACE_BUCKET_NAME/?list-type=2&delimiter=%2F&max-keys=2&prefix=REPLACE_WITH_PREFIX%2F&fetch-owner=false HTTP/1.1[\r][\n]
- http-outgoing-0 >> Host: api.g.....com:39999[\r][\n]
- http-outgoing-0 >> amz-sdk-invocation-id: ...
- http-outgoing-0 >> amz-sdk-request: ...
- http-outgoing-0 >> amz-sdk-retry: 0/0/500
- http-outgoing-0 >> Authorization: ...
- http-outgoing-0 >> Content-Type: application/octet-stream
- http-outgoing-0 >> Content-Length: 0
- http-outgoing-0 >> Connection: Keep-Alive
- http-outgoing-0 >> [\r][\n]
- http-outgoing-0 << HTTP/1.1 404 Not Found
- http-outgoing-0 << Date: Thu, 07 Jul 2022 05:48:00 GMT[\r][\n]
- http-outgoing-0 << Content-Type: application/xml[\r][\n]
- http-outgoing-0 << Content-Length: 196[\r][\n]
- http-outgoing-0 << Server: Jetty(9.4.43.v20210629)[\r][\n]
- http-outgoing-0 << [\r][\n]
- http-outgoing-0 << <Error><RequestId></RequestId><Code>NoSuchBucket</Code><Message>Path /REPLACE_BUCKET_NAME/REPLACE_WITH_PREFIX does not exist.</Message><Resource>REPLACE_BUCKET_NAME</Resource></Error>
- http-outgoing-0 << *HTTP/1.1 404 Not Found*
To Reproduce Steps to reproduce the behavior (as minimally and precisely as possible)
- create a 1:1 mapped s3 bucket mount which is such that: s3://some_bucket on alluxio === s3://some_bucket on s3
Try to Expected behavior Expect a similar response, 200 OK as returned from S3 API
Urgency Critical, blocks creation of files from Spark
Are you planning to fix it TBD
Additional context Trying to use Alluxio to read/write data from Spark. Writing some data from Spark works fine when s3a endpoint is not used, fails when Alluxio S3 API endpoint is used due to above 404 response.
@abmo-x thanks for reporting. @ZhuTopher can you take a look?
@abmo-x I believe this issue has been resolved by this PR #15965.
I tried to replicate your setup with EMR as the Hadoop & Spark cluster. You can see my reproduction steps here.
- I used an S3 root UFS mount rather than a sub-directory mount.
@ZhuTopher Thanks for the update! looking forwarding to the new release.
@abmo-x Would you be able to provide more details about the Access Key you provided and maybe the request code?
eg: if it is a Spark job then something like:
val hadoop_conf = spark.sparkContext.hadoopConfiguration
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.access.key", "alluxio")
hadoop_conf.set("fs.s3a.secret.key", "dummy value")
hadoop_conf.set("fs.s3a.path.style.access", "true")
hadoop_conf.set("fs.s3a.endpoint", "http://{ALLUXIO_MASTER_PRIVATE_IP}:39999/api/v1/s3")
val df = spark.read.csv("s3a://testbucket/customers.csv")
df.show()
df.rdd.saveAsTextFile("s3a://testbucket/copy.csv")
Also please provide the logs in logs/proxy.log
.
@abmo-x I believe this issue has been resolved by this PR #15965.
I tried to replicate your setup with EMR as the Hadoop & Spark cluster. You can see my reproduction steps here.
- I used an S3 root UFS mount rather than a sub-directory mount.
This hasn't been resolved by that above PR, instead here is a PR which should fix this: https://github.com/Alluxio/alluxio/pull/16074
- For some reason, the bundled EMR Spark behaves differently than the open-source Apache Spark and so I couldn't repro this using my EMR environment