spark-redshift icon indicating copy to clipboard operation
spark-redshift copied to clipboard

WARN Utils: An error occurred while trying to read the S3 bucket lifecycle configuration java.lang.NullPointerException

Open pedromb opened this issue 7 years ago • 18 comments

Hello guys, I am getting this warn

WARN Utils$: An error occurred while trying to read the S3 bucket lifecycle configuration
java.lang.NullPointerException
        at java.lang.String.startsWith(String.java:1385)
        at java.lang.String.startsWith(String.java:1414)
        at com.databricks.spark.redshift.Utils$$anonfun$3.apply(Utils.scala:102)
        at com.databricks.spark.redshift.Utils$$anonfun$3.apply(Utils.scala:98)
        at scala.collection.Iterator$class.exists(Iterator.scala:753)
        at scala.collection.AbstractIterator.exists(Iterator.scala:1157)
        at scala.collection.IterableLike$class.exists(IterableLike.scala:77)
        at scala.collection.AbstractIterable.exists(Iterable.scala:54)
        at com.databricks.spark.redshift.Utils$.checkThatBucketHasObjectLifecycleConfiguration(Utils.scala:98)
        at com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:361)
        at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:106)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:222)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)

I have seen this issue here before, but it still occurs for me.

I do have a lifecycle configuration for my bucket. I've traced this warn to this piece of code

def checkThatBucketHasObjectLifecycleConfiguration(
      tempDir: String,
      s3Client: AmazonS3Client): Unit = {
    try {
      val s3URI = createS3URI(Utils.fixS3Url(tempDir))
      val bucket = s3URI.getBucket
      assert(bucket != null, "Could not get bucket from S3 URI")
      val key = Option(s3URI.getKey).getOrElse("")
      val hasMatchingBucketLifecycleRule: Boolean = {
        val rules = Option(s3Client.getBucketLifecycleConfiguration(bucket))
          .map(_.getRules.asScala)
          .getOrElse(Seq.empty)
        rules.exists { rule =>
          // Note: this only checks that there is an active rule which matches the temp directory;
          // it does not actually check that the rule will delete the files. This check is still
          // better than nothing, though, and we can always improve it later.
          rule.getStatus == BucketLifecycleConfiguration.ENABLED && key.startsWith(rule.getPrefix)
        }
      }
      if (!hasMatchingBucketLifecycleRule) {
        log.warn(s"The S3 bucket $bucket does not have an object lifecycle configuration to " +
          "ensure cleanup of temporary files. Consider configuring `tempdir` to point to a " +
          "bucket with an object lifecycle policy that automatically deletes files after an " +
          "expiration period. For more information, see " +
          "https://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html")
      }
    } catch {
      case NonFatal(e) =>
        log.warn("An error occurred while trying to read the S3 bucket lifecycle configuration", e)
    }
  }

I believe the exception is thrown because of this key.startsWith(rule.getPrefix)

I checked the Amazon SDK documents, the method getPrefix returns null if the prefix wasn't set using the setPrefix method, therefore it will always return null in this case.

I have a very limited knowledge of the Amazon SDK and Scala, so I'm not really sure about this.

pedromb avatar May 10 '17 15:05 pedromb

The same here:

17/05/12 13:57:56 WARN redshift.Utils$: An error occurred while trying to read the S3 bucket lifecycle configuration
java.lang.NullPointerException
	at java.lang.String.startsWith(String.java:1405)
	at java.lang.String.startsWith(String.java:1434)
	at com.databricks.spark.redshift.Utils$$anonfun$5.apply(Utils.scala:140)
	at com.databricks.spark.redshift.Utils$$anonfun$5.apply(Utils.scala:136)
	at scala.collection.Iterator$class.exists(Iterator.scala:919)
	at scala.collection.AbstractIterator.exists(Iterator.scala:1336)
	at scala.collection.IterableLike$class.exists(IterableLike.scala:77)
	at scala.collection.AbstractIterable.exists(Iterable.scala:54)
	at com.databricks.spark.redshift.Utils$.checkThatBucketHasObjectLifecycleConfiguration(Utils.scala:136)
	at com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:389)
	at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:108)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
...
...

I thought it had to do with not setting a bucket prefix when configuring the lifecycle policy but even after setting it, it keeps showing (although the operation succeeds)

dmnava avatar May 12 '17 12:05 dmnava

+1

divnalam avatar Jun 13 '17 09:06 divnalam

+1

Joe29 avatar Jun 21 '17 14:06 Joe29

+1

WajdiF avatar Jul 10 '17 13:07 WajdiF

+1

markdessain avatar Jul 18 '17 16:07 markdessain

I think I have a PR that fixes this (you need to upgrade AWS SDK dependencies). See: https://github.com/databricks/spark-redshift/pull/357

BorePlusPlus avatar Jul 21 '17 17:07 BorePlusPlus

Disclaimer: New to Scala I am using the jars from http://repo1.maven.org/maven2/com/databricks/ and I am getting this same error when trying to write to redshift. I am running this from the Spark shell for debugging and every time I do I get this error and my shell hangs but the operation never completes.

watchingant avatar Aug 24 '17 15:08 watchingant

+1 seeing this in pyspark

mylons avatar Oct 18 '17 17:10 mylons

+1

gorros avatar Dec 13 '17 13:12 gorros

It seems to me that this issue has something to do with the fact that com.amazonaws.aws-java-sdk-s3 dependency is provided. When I run spark job locally from ide it works, but it does not work on AWS EMR. I think when deployed jar uses libraries, provided by aws environment, which are probably new and conflict with spark-redshift. Meanwhile locally it uses old libraries and as a result, it works. As a temporary solution a suggest to use fix by @BorePlusPlus .

    <repositories>
        <repository>
            <id>jitpack.io</id>
            <url>https://jitpack.io</url>
        </repository>
    </repositories>
...
    <dependency>
        <groupId>com.github.BorePlusPlus</groupId>
        <artifactId>spark-redshift_2.11</artifactId>
        <version>bucket-lifecycle-check-upgrade-SNAPSHOT</version>
    </dependency>

gorros avatar Dec 14 '17 10:12 gorros

I agree that this is a super annoying error, since the stack trace is so long. This solution worked for me:

spark.sparkContext.setLogLevel("ERROR")

I got the suggestion from here.

RyanZotti avatar Mar 09 '18 18:03 RyanZotti

+1

aymkhalil avatar Sep 07 '18 23:09 aymkhalil

For us it turned out "the file is not there" - that is being attempted to be read and thus

"An error occurred while trying to read the S3 bucket lifecycle configuration java.lang.NullPointerException"

and a subsequent

"S3ServiceException:Access Denied,Status 403,Error AccessDenied,"

It would seem we are reading before the file is available - parallel processing woes?

Object not found results in 403 (access denied) rather than 404 (not found) because different return codes would provide an attacker with useful information - it leaks information that an object of a given name actually exists. A simple dictionary-style attack could then enumerate all of the objects in someone's bucket. For a similar reason, a login page should never emit "Invalid user" and "Invalid password" for the two authentication failure scenarios; it should always emit "Invalid credentials".

A fix would then be

Check the regions.

For example: It was because the region was set to "us-west-2" that was visible on the aws console link. However the contents were hosted on ap-southeast-1.

Check Permissions.

By default, permissions are given to the AWS user only. If you use IAM authentication with access keys, you must add permissions to "authenticated users" in S3.

"...If the object you request does not exist, the error Amazon S3 returns depends on whether you also have the s3:ListBucket permission.

If you have the s3:ListBucket permission on the bucket, Amazon S3 will return an HTTP status code 404 ("no such key") error. if you don’t have the s3:ListBucket permission, Amazon S3 will return an HTTP status code 403 ("access denied") error."

Keep your role policy as in the helloV post. Go to S3. Select your bucket. Click Permissions. Click Bucket Policy. Try something like this: { "Version": "2012-10-17", "Id": "Lambda access bucket policy", "Statement": [ { "Sid": "All on objects in bucket lambda", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::AWSACCOUNTID:root" }, "Action": "s3:", "Resource": "arn:aws:s3:::BUCKET-NAME/" }, { "Sid": "All on bucket by lambda", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::AWSACCOUNTID:root" }, "Action": "s3:*", "Resource": "arn:aws:s3:::BUCKET-NAME" } ] }

Your architecture chooses the right solution, hope this helps

dvelle avatar Dec 14 '18 17:12 dvelle

By now, I have implemented multiple Spark applications with this library and the issue does not affect anything.

gorros avatar Dec 14 '18 18:12 gorros

i solve the problem inverting this params before :

sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_id)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_key)

after:

sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_id)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_key)
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

1311543 avatar Aug 19 '19 23:08 1311543

+1 Any news how to fix this properly?

kovacshuni avatar Mar 10 '21 10:03 kovacshuni

+1 Just saw this happening using databricks runs (using Spark 3.2.1).

feliperoos avatar Aug 13 '22 05:08 feliperoos

I was able to silence this by setting this piece of code's logger to ERROR

import org.apache.log4j.{Level, Logger}

// insert this line after spark session initiation
Logger.getLogger("com.databricks.spark.redshift.Utils$").setLevel(Level.ERROR)

vnktsh avatar Feb 01 '24 04:02 vnktsh