dolma icon indicating copy to clipboard operation
dolma copied to clipboard

S3 mixer doesn't start

Open marcopasqua opened this issue 3 months ago • 2 comments

Hello everyone, I am currently using SageMaker connected to an S3 Bucket. I successfully downloaded data and obtained tagging results with Dolma without encountering any issues. However, during the final mixing step, I encountered the following error:

[2024-03-26T11:38:37Z INFO dolma::s3_util] Listing objects in bucket=h-datasets, prefix=pretraining/wikipedia/v0/documents/ thread '<unnamed>' panicked at src/s3_util.rs:249:18: called 'Result::unwrap()' on an 'Err' value: ServiceError(ServiceError { source: Unhandled(Unhandled { source: ErrorMetadata { code: Some("PermanentRedirect"), message: Some("The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."), extras: Some({"aws_request_id": "M0KWX3F3J6VJX7FS", "s3_extended_request_id": "dHPizkgVIoHo4PH0GgJQVt+LvOXiFOKdN7JKqIP6pDWGqvGLmAxrY4Ct7JSCA3geVAkkJxOCpwo="}) }, meta: ErrorMetadata { code: Some("PermanentRedirect"), message: Some("The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."), extras: Some({"aws_request_id": "M0KWX3F3J6VJX7FS", "s3_extended_request_id": "dHPizkgVIoHo4PH0GgJQVt+LvOXiFOKdN7JKqIP6pDWGqvGLmAxrY4Ct7JSCA3geVAkkJxOCpwo="}) } }), raw: Response { inner: Response { status: 301, version: HTTP/1.1, headers: {"x-amz-bucket-region": "us-west-2", "x-amz-request-id": "M0KWX3F3J6VJX7FS", "x-amz-id-2": "dHPizkgVIoHo4PH0GgJQVt+LvOXiFOKdN7JKqIP6pDWGqvGLmAxrY4Ct7JSCA3geVAkkJxOCpwo=", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Tue, 26 Mar 2024 11:38:36 GMT", "server": "AmazonS3"}, body: SdkBody { inner: Once(Some(b"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message><Endpoint>h-datasets.s3-us-west-2.amazonaws.com</Endpoint><Bucket>h-datasets</Bucket><RequestId>M0KWX3F3J6VJX7FS</RequestId><HostId>dHPizkgVIoHo4PH0GgJQVt+LvOXiFOKdN7JKqIP6pDWGqvGLmAxrY4Ct7JSCA3geVAkkJxOCpwo=</HostId></Error>")), retryable: true } }, properties: SharedPropertyBag(Mutex { data: PropertyBag { contents: ["aws_smithy_http::connection::CaptureSmithyConnection", "aws_http::user_agent::AwsUserAgent", "aws_sdk_s3::endpoint::Params", "aws_credential_types::credentials_impl::Credentials", "aws_types::region::SigningRegion", "aws_types::region::Region", "aws_credential_types::cache::SharedCredentialsCache", "aws_smithy_types::endpoint::Endpoint", "aws_sig_auth::middleware::Signature", "aws_sig_auth::signer::OperationSigningConfig", "aws_types::SigningService", "alloc::vec::Vec<http::version::Version>", "aws_smithy_http::operation::Metadata"] }, poisoned: false, .. }) } }) note: run with 'RUST_BACKTRACE=1' environment variable to display a backtrace

Regarding permissions, everything appears to be functioning correctly since I can list the objects using my credentials, which are stored in aws/credentials and also exported in the environment variables.

Does anyone have any suggestions? Thank you for your time.

marcopasqua avatar Mar 26 '24 11:03 marcopasqua