sagemaker-spark
sagemaker-spark copied to clipboard
Wrong version of aws-java-sdk-bundle in sagemaker-spark 1.4.5
System Information
- Spark or PySpark: 3.3.0
- SDK Version: 1.4.5
- Spark Version: 3.3.0
Describe the problem
I just spend 3 days trying to fix this but to no avail. My setup on an AWS notebook instance: jars: aws-java-sdk-bundle-1.11.901.jar aws-java-sdk-core-1.12.262.jar aws-java-sdk-kms-1.12.262.jar aws-java-sdk-s3-1.12.262.jar aws-java-sdk-sagemaker-1.12.262.jar aws-java-sdk-sagemakerruntime-1.12.262.jar aws-java-sdk-sts-1.12.262.jar hadoop-aws-3.3.1.jar sagemaker-spark_2.12-spark_3.3.0-1.4.5.jar
Problem:
- Upon reading a file from S3 this error is thrown this is caused by a bug in the httpclient jar dependency of pyspark and is reported here: https://issues.apache.org/jira/browse/HADOOP-18159?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17554677#comment-17554677
Based on suggested workarounds in the article above I tried 4 things
- upgrade
aws-java-sdk-bundle
to version 1.12.262 like the other jars → didn’t work - downgrade
httpclient
to version 4.5.10 → didn’t work - tried to set the
aws-java-sdk
to disable SSL certificate checking (https://github.com/aws/aws-sdk-java-v2/issues/1786 ) → didn’t work with "-Dcom.amazonaws.sdk.disableCertChecking=true" - try to read from a bucket that doesn’t contain dots (.) → works
Minimal repo / logs
22/08/30 11:00:22 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3a://comp.data.sci.data.tst/some/folder/export_date=20220822. org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on s3a://comp.data.sci.data.tst/some/folder/export_date=20220822: com.amazonaws.SdkClientException: Unable to execute HTTP request: Certificate for <comp.data.sci.data.tst.s3.amazonaws.com> doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: Unable to execute HTTP request: Certificate for <comp.data.sci.data.tst.s3.amazonaws.com> doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351) at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185) at org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210) at scala.Option.getOrElse(Option.scala:189)
-
Exact command to reproduce:
Works:
df = spark.read.parquet("s3a://aws-bucket-with-dashes/file_0_1_0.snappy.parquet")
Doesn't work:df = spark.read.parquet("s3a://aws.bucket.with.dots/file_0_1_0.snappy.parquet")
It's not possible to rename the bucket due to the many data consumers that depend on them.
- you shouldn't be duplicating sagemaker jars with the sdk bundle, as that contains everything and is meant to be shaded so as to avoid transient dependency issues.
- it's probably a problem with other things on your classpath
- s3a connector support for buckets with dots is incomplete and wont be fixed