sagemaker-spark Wrong version of aws-java-sdk-bundle in sagemaker-spark 1.4.5

Wrong version of aws-java-sdk-bundle in sagemaker-spark 1.4.5

Open jobvisser03 opened this issue 1 year ago • 1 comments

System Information

Spark or PySpark: 3.3.0
SDK Version: 1.4.5
Spark Version: 3.3.0

Describe the problem

I just spend 3 days trying to fix this but to no avail. My setup on an AWS notebook instance: jars: aws-java-sdk-bundle-1.11.901.jar aws-java-sdk-core-1.12.262.jar aws-java-sdk-kms-1.12.262.jar aws-java-sdk-s3-1.12.262.jar aws-java-sdk-sagemaker-1.12.262.jar aws-java-sdk-sagemakerruntime-1.12.262.jar aws-java-sdk-sts-1.12.262.jar hadoop-aws-3.3.1.jar sagemaker-spark_2.12-spark_3.3.0-1.4.5.jar

Problem:

Upon reading a file from S3 this error is thrown this is caused by a bug in the httpclient jar dependency of pyspark and is reported here: https://issues.apache.org/jira/browse/HADOOP-18159?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17554677#comment-17554677

Based on suggested workarounds in the article above I tried 4 things

upgrade aws-java-sdk-bundle to version 1.12.262 like the other jars → didn’t work
downgrade httpclient to version 4.5.10 → didn’t work
tried to set the aws-java-sdk to disable SSL certificate checking (https://github.com/aws/aws-sdk-java-v2/issues/1786 ) → didn’t work with "-Dcom.amazonaws.sdk.disableCertChecking=true"
try to read from a bucket that doesn’t contain dots (.) → works

Minimal repo / logs

22/08/30 11:00:22 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3a://comp.data.sci.data.tst/some/folder/export_date=20220822. org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on s3a://comp.data.sci.data.tst/some/folder/export_date=20220822: com.amazonaws.SdkClientException: Unable to execute HTTP request: Certificate for <comp.data.sci.data.tst.s3.amazonaws.com> doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: Unable to execute HTTP request: Certificate for <comp.data.sci.data.tst.s3.amazonaws.com> doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351) at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185) at org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210) at scala.Option.getOrElse(Option.scala:189)

Exact command to reproduce: Works: df = spark.read.parquet("s3a://aws-bucket-with-dashes/file_0_1_0.snappy.parquet") Doesn't work: df = spark.read.parquet("s3a://aws.bucket.with.dots/file_0_1_0.snappy.parquet")

It's not possible to rename the bucket due to the many data consumers that depend on them.

Aug 30 '22 11:08 jobvisser03

you shouldn't be duplicating sagemaker jars with the sdk bundle, as that contains everything and is meant to be shaded so as to avoid transient dependency issues.
it's probably a problem with other things on your classpath
s3a connector support for buckets with dots is incomplete and wont be fixed

Oct 05 '22 12:10 steveloughran

sagemaker-spark sagemaker-spark copied to clipboard

Wrong version of aws-java-sdk-bundle in sagemaker-spark 1.4.5

System Information

Describe the problem

Minimal repo / logs

sagemaker-spark
sagemaker-spark copied to clipboard