tensorstore Debug logs for S3 driver, log requests and responses

Hello tensorstore team,

I am trying to use S3 driver with an S3-compatible storage implementation (i.e. I have custom endpoint_url and aws_region). Unfortunately, it did not work for me out of the box, I am getting a 405 HTTP error code.

From the produced error message, I can't figure out whether tensorstore goes to the right endpoint, what request is actually performed.

Is there a way to enable logging of all HTTP requests and responses for S3 driver?

I am using Python package tensorstore==0.1.52.

Thanks!

Minimal example

Here I took an example from the docs, but replaced file driver with s3 driver (endpoint, aws_region, and bucket are masked in the snippet):

import tensorstore as ts
import numpy as np
dataset = ts.open({
    'driver': 'n5',
    'kvstore': {
        'driver': 's3',
        'bucket': '<<my bucket>>',
        'path': 'home/sbkarasik/junk/ts-test-v2',
        'endpoint': '<<my endpoint>>',
        'aws_region': '<<my region>>',
    },
    'metadata': {
        'compression': {
            'type': 'gzip'
        },
        'dataType': 'uint32',
        'dimensions': [1000, 20000],
        'blockSize': [100, 100],
    },
    'create': True,
    # 'delete_existing': True,
}).result()

And output is like:

ValueError: UNKNOWN: Error opening \"n5\" driver: Error writing \"home/sbkarasik/junk/ts-test-v2/attributes.json\": HTTP response code: 405 with body:  [tensorstore_spec='{\\\"context\\\":{\\\"aws_credentials\\\":{\\\"filename\\\":\\\"\\\",\\\"metadata_endpoint\\\":\\\"\\\",\\\"profile\\\":\\\"\\\"},\\\"cache_pool\\\":{},\\\"data_copy_concurrency\\\":{},\\\"s3_request_concurrency\\\":{},\\\"s3_request_retries\\\":{}},\\\"create\\\":true,\\\"driver\\\":\\\"n5\\\",\\\"dtype\\\":\\\"uint32\\\",\\\"kvstore\\\":{\\\"aws_region\\\":\\\"<<my region>>\\\",\\\"bucket\\\":\\\"<<my bucket>>\\\",\\\"driver\\\":\\\"s3\\\",\\\"endpoint\\\":\\\"<<my endpoint url>>\\\",\\\"path\\\":\\\"home/sbkarasik/junk/ts-test-v2/\\\"},\\\"metadata\\\":{\\\"blockSize\\\":[100,100],\\\"compression\\\":{\\\"level\\\":-1,\\\"type\\\":\\\"gzip\\\",\\\"useZlib\\\":false},\\\"dataType\\\":\\\"uint32\\\",\\\"dimensions\\\":[1000,20000]},\\\"transform\\\":{\\\"input_exclusive_max\\\":[[1000],[20000]],\\\"input_inclusive_min\\\":[0,0]}}'] [source locations='tensorstore/kvstore/s3/s3_key_value_store.cc:779\
tensorstore/kvstore/kvstore.cc:378\
tensorstore/driver/driver.cc:114']

Mar 03 '24 12:03 minotru

Well, I missed a logical idea, and only now found that there is an env flag TENSORSTORE_CURL_VERBOSE.

Now I see that it seems like s3 driver does not use the provided endpoint at some point and makes a call to s3.<<my region>>.amazonaws.com instead of my endpoint.

Here is a full log for the example above, with TENSORSTORE_CURL_VERBOSE and some sensitive parts masked:

< HTTP/2 404 
< server: nginx
< date: Sun, 03 Mar 2024 12:57:13 GMT
< content-type: text/html; charset=utf-8
< x-amz-request-id: 6dd64559c7556f83
< content-encoding: gzip
< 
* Connection #1 to host <<my endpoint>> left intact
* Found bundle for host: 0x7f7ed00813d0 [can multiplex]
* Re-using existing connection with host <<my endpoint>>
* [HTTP/2] [3] OPENED stream for https://<<my endpoint>>/<<my bucket>>/home/sbkarasik/junk/ts-test-v2/attributes.json
* [HTTP/2] [3] [:method: HEAD]
* [HTTP/2] [3] [:scheme: https]
* [HTTP/2] [3] [:authority: s3.<<my region>>.amazonaws.com]
* [HTTP/2] [3] [:path: /<<my bucket>>/home/sbkarasik/junk/ts-test-v2/attributes.json]
* [HTTP/2] [3] [user-agent: tensorstore/0.1 libcurl/8.4.0 BoringSSL zlib/1.2.13 brotli/1.0.9 nghttp2/1.55.0]
* [HTTP/2] [3] [accept: */*]
* [HTTP/2] [3] [if-match: ""]
* [HTTP/2] [3] [x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855]
* [HTTP/2] [3] [x-amz-date: 20240303T125713Z]
* [HTTP/2] [3] [authorization: AWS4-HMAC-SHA256 Credential=<<my aws_access_key_id>>/20240303/<<my region>>/s3/aws4_request,SignedHeaders=host;if-match;x-amz-content-sha256;x-amz-date,Signature=<<my computed signature>>]
> HEAD /<<my bucket>>/home/sbkarasik/junk/ts-test-v2/attributes.json HTTP/2
Host: s3.<<my region>>.amazonaws.com
User-Agent: tensorstore/0.1 libcurl/8.4.0 BoringSSL zlib/1.2.13 brotli/1.0.9 nghttp2/1.55.0
Accept: */*
if-match: ""
x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
x-amz-date: 20240303T125713Z
Authorization: AWS4-HMAC-SHA256 Credential=<<my aws_access_key_id>>/20240303/<<my region>>/s3/aws4_request,SignedHeaders=host;if-match;x-amz-content-sha256;x-amz-date,Signature=<<my computed signature>>

< HTTP/2 404 
< server: nginx
< date: Sun, 03 Mar 2024 12:57:13 GMT
< content-type: text/html; charset=utf-8
< content-length: 231
< x-amz-request-id: caf8681c3e346fc0
< 
* Connection #1 to host <<my endpoint>> left intact
* Found bundle for host: 0x7f7ed00813d0 [can multiplex]
* Re-using existing connection with host <<my endpoint>>
* [HTTP/2] [5] OPENED stream for https://<<my endpoint>>/<<my bucket>>/home/sbkarasik/junk/ts-test-v2/attributes.json
* [HTTP/2] [5] [:method: PUT]
* [HTTP/2] [5] [:scheme: https]
* [HTTP/2] [5] [:authority: s3.<<my region>>.amazonaws.com]
* [HTTP/2] [5] [:path: /<<my bucket>>/home/sbkarasik/junk/ts-test-v2/attributes.json]
* [HTTP/2] [5] [user-agent: tensorstore/0.1 libcurl/8.4.0 BoringSSL zlib/1.2.13 brotli/1.0.9 nghttp2/1.55.0]
* [HTTP/2] [5] [accept: */*]
* [HTTP/2] [5] [content-type: application/octet-stream]
* [HTTP/2] [5] [content-length: 126]
* [HTTP/2] [5] [x-amz-content-sha256: 4703d59e791f75f12dd6daf6bce5907cd682f8294957c79eb7c2e409054bb9b6]
* [HTTP/2] [5] [x-amz-date: 20240303T125713Z]
* [HTTP/2] [5] [authorization: AWS4-HMAC-SHA256 Credential=<<my aws_access_key_id>>/20240303/<<my region>>/s3/aws4_request,SignedHeaders=content-length;content-type;host;x-amz-content-sha256;x-amz-date,Signature=<<my computed signature>>]
> PUT /<<my bucket>>/home/sbkarasik/junk/ts-test-v2/attributes.json HTTP/2
Host: s3.<<my region>>.amazonaws.com
User-Agent: tensorstore/0.1 libcurl/8.4.0 BoringSSL zlib/1.2.13 brotli/1.0.9 nghttp2/1.55.0
Accept: */*
Content-Type: application/octet-stream
Content-Length: 126
x-amz-content-sha256: 4703d59e791f75f12dd6daf6bce5907cd682f8294957c79eb7c2e409054bb9b6
x-amz-date: 20240303T125713Z
Authorization: AWS4-HMAC-SHA256 Credential=<<my aws_access_key_id>>/20240303/<<my region>>/s3/aws4_request,SignedHeaders=content-length;content-type;host;x-amz-content-sha256;x-amz-date,Signature=<<my computed signature>>

* We are completely uploaded and fine
< HTTP/2 405 
< server: nginx
< date: Sun, 03 Mar 2024 12:57:13 GMT
< content-type: application/json
< content-length: 0
< 
* Connection #1 to host <<my endpoint>> left intact

Mar 03 '24 13:03 minotru

Checked the source code of tensorstore, found that Host: s3.<<my region>>.amazonaws.com is caused by code in s3_endpoint.cc.

I see that there a way set custom host_header, so I asked my team what is proper header value for our storage. Will let you know if it worked with properhost_header.

Yet, it would be great if tensorstore could identify proper host_header based on the given endpoint automatically.

I see that awscli needs only endpoint_url and region in ~/.aws/config, so there should be a way to identify proper host_header automatically.

Sorry for openning an issue without sufficient prior debugging (I was trying to read source code to find the root cause of the issue and find a way to enable more logs, did not see this env flag in docs) and creating all that fuss in comments. At least hope that this debug history can help someone.

Mar 03 '24 13:03 minotru

Yes I think we should probably have host_header default to the host given by endpoint.

Mar 04 '24 00:03 jbms

From my testing, just assuming that the endpoint works as the Host isn't a good answwer. Instead we issue a HEAD request https://github.com/google/tensorstore/blob/a063e7ab1f61c9db80006f2e615b5a7ebe8da896/tensorstore/kvstore/s3/s3_endpoint.cc#L249 and trust the Host: header response from the server. This works on aws and many aws-compatible services such as minio & localstack. If the service doesn't respond correctly, the host_header can be specified directly.

You can see in the log that the Host: response of the initial request is:

Host: s3.<<my region>>.amazonaws.com

Which s3-compatible storage mechanism do you use, and why doesn't it respond with an appropriate host header?

...

I think that the awscli uses this code to generate the host header: https://github.com/boto/botocore/blob/359db3f85909396e6cabc1ddee4868a7b1efb1ce/botocore/auth.py#L70

Mar 04 '24 07:03 laramiel

Thank you for such a fast response!

Which s3-compatible storage mechanism do you use

I am using storage by Nebius.ai (https://nebius.ai/docs/storage/), as far as I see, it is fully compatible with S3 API, awscli works fine with custom endpoint_url, region in ~/.aws/config.

My team recommended me to use host_header = "storage.nemax.nebius.cloud". And it worked like this!

This seems to be aligned with _host_from_url from awscli source code, that @laramiel has mentioned. It would be good to support the same logic in tensorstore.

About HEAD request in s3_endpoint.cc. As far as I see, here it creates an S3CustomFormatter: https://github.com/google/tensorstore/blob/a063e7ab1f61c9db80006f2e615b5a7ebe8da896/tensorstore/kvstore/s3/s3_endpoint.cc#L255

Which generates a host_header according to template "s3.%s.amazonaws.com" (given that host_header is empty by default) https://github.com/google/tensorstore/blob/a063e7ab1f61c9db80006f2e615b5a7ebe8da896/tensorstore/kvstore/s3/s3_endpoint.cc#L99

And then, I assume, it makes a request with this s3.%s.amazonaws.com-like host_header, Nebius.ai server responds with 405 (which seems legit) and the error is raised up to Python SDK.

Mar 04 '24 13:03 minotru

@laramiel, is it right that current s3 driver implementation implements upload as one HTTP request?

I am trying to save a 10+ GB checkpoint via S3 driver in OCDBT format. OCDBT packs most files in one big blob. Seems like S3 driver is trying to upload this blob in one request and the server responds with 413 Request Entity Too Large.

Seems like multi-part upload is crucial to S3 driver usability.

Overall, I suppose that using aws-sdk for integration with S3 (for both auth and IO) might be benificial, because it brings robust implementation and simplifies S3 driver code.

Mar 04 '24 16:03 minotru

Yeah, we don't have multi-part uploads yet.

Mar 04 '24 20:03 laramiel

I've pushed a change to improve host: header; on the next release you should no longer need the "host_header" field.

Mar 06 '24 00:03 laramiel

@laramiel , thank you so much!

Also, I have just noticed a new commit https://github.com/google/tensorstore/commit/b30c91b1337ac4c2c74330f14f6872c1e4f3a040 about target_data_file_size for ocdbt. That's also great! I am not sure how about performances differences of real multi-part uploaded VS a set of smaller files, but I will try to check.

Mar 12 '24 14:03 minotru

Note: There are debug logs for "s3" and "http" and "curl", which can be set via flags: --tensorstore_verbose_logging=s3,http=1,curl or environment variables: TENSORSTORE_VERBOSE_LOGGING=s3,http=1,curl

And the curl logging is also available.

Jun 11 '24 17:06 laramiel

tensorstore tensorstore copied to clipboard

Debug logs for S3 driver, log requests and responses

tensorstore
tensorstore copied to clipboard