sagemaker-spark icon indicating copy to clipboard operation
sagemaker-spark copied to clipboard

Wrong version of aws-java-sdk-bundle in sagemaker-spark 1.4.5

Open jobvisser03 opened this issue 1 year ago • 1 comments

System Information

  • Spark or PySpark: 3.3.0
  • SDK Version: 1.4.5
  • Spark Version: 3.3.0

Describe the problem

I just spend 3 days trying to fix this but to no avail. My setup on an AWS notebook instance: jars: aws-java-sdk-bundle-1.11.901.jar aws-java-sdk-core-1.12.262.jar aws-java-sdk-kms-1.12.262.jar aws-java-sdk-s3-1.12.262.jar aws-java-sdk-sagemaker-1.12.262.jar aws-java-sdk-sagemakerruntime-1.12.262.jar aws-java-sdk-sts-1.12.262.jar hadoop-aws-3.3.1.jar sagemaker-spark_2.12-spark_3.3.0-1.4.5.jar

Problem:

  • Upon reading a file from S3 this error is thrown this is caused by a bug in the httpclient jar dependency of pyspark and is reported here: https://issues.apache.org/jira/browse/HADOOP-18159?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17554677#comment-17554677

Based on suggested workarounds in the article above I tried 4 things

  1. upgrade aws-java-sdk-bundle to version 1.12.262 like the other jars → didn’t work
  2. downgrade httpclient to version 4.5.10 → didn’t work
  3. tried to set the aws-java-sdk to disable SSL certificate checking (https://github.com/aws/aws-sdk-java-v2/issues/1786 ) → didn’t work with "-Dcom.amazonaws.sdk.disableCertChecking=true"
  4. try to read from a bucket that doesn’t contain dots (.) → works

Minimal repo / logs

22/08/30 11:00:22 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3a://comp.data.sci.data.tst/some/folder/export_date=20220822. org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on s3a://comp.data.sci.data.tst/some/folder/export_date=20220822: com.amazonaws.SdkClientException: Unable to execute HTTP request: Certificate for <comp.data.sci.data.tst.s3.amazonaws.com> doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: Unable to execute HTTP request: Certificate for <comp.data.sci.data.tst.s3.amazonaws.com> doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351) at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185) at org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210) at scala.Option.getOrElse(Option.scala:189)

  • Exact command to reproduce: Works: df = spark.read.parquet("s3a://aws-bucket-with-dashes/file_0_1_0.snappy.parquet") Doesn't work: df = spark.read.parquet("s3a://aws.bucket.with.dots/file_0_1_0.snappy.parquet")

It's not possible to rename the bucket due to the many data consumers that depend on them.

jobvisser03 avatar Aug 30 '22 11:08 jobvisser03

  1. you shouldn't be duplicating sagemaker jars with the sdk bundle, as that contains everything and is meant to be shaded so as to avoid transient dependency issues.
  2. it's probably a problem with other things on your classpath
  3. s3a connector support for buckets with dots is incomplete and wont be fixed

steveloughran avatar Oct 05 '22 12:10 steveloughran