spark-redshift icon indicating copy to clipboard operation
spark-redshift copied to clipboard

java.sql.SQLException: [Amazon](500310) Invalid operation: S3ServiceException:The specified key does not exist

Open lehnerm opened this issue 9 years ago • 22 comments

We are currently using your redshift driver as a sink of a spark stream that copies batches of ~5 minutes from a Kafka log directly into Redshift.

After a random amount of time, mostly 3 to 6 days, the spark driver will fail with the following exception:

a.sql.SQLException: [Amazon](500310) Invalid operation: S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid X,ExtRid X,CanRetry 1
Details: 
 -----------------------------------------------
  error:  S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid X,ExtRid X,CanRetry 1
  code:      8001
  context:   S3 key being read : s3://bucket/folder/d7b2b0ad-1fdd-4777-8a9c-46e5449e7681/part-r-00004-bf98d605-595e-402b-9109-8a6cde5ea7ee.avro
  query:     321881492
  location:  table_s3_scanner.cpp:345
  process:   query6_45 [pid=16853]

The file does exist on S3 after i check it. So this seems to be a rare race condition between the execution of the copy command against redshift and the write of the data files to s3n. We are using the s3n:// filesystem for the temporary folder.

Any ideas on how to fix this? Thanks!!

lehnerm avatar Dec 11 '15 12:12 lehnerm

Hi @lehnerm,

Just to help me narrow down potential causes, could you let me know which Spark Redshift version you're using and which AWS region(s) are hosting your Spark driver, S3 bucket, and Redshift cluster?

The files are definitely written to S3 before we issue the Redshift COPY command, so my hunch is that any race-condition is due to S3 eventual consistency issues.

In spark-redshift 0.5.1+ (see #99), we use a manifest to tell Redshift the exact set of files which should be loaded. If we were to just pass Redshift the name of the directory containing the Avro files, without explicitly listing every file's name, then eventual consistency could mean that Redshift wouldn't see some partitions and would just silently skip them since it didn't know to expect them.

What I suspect might be happening is that you are hitting such an eventual consistency issue and this error message means that Redshift caught the issue and failed rather than silently losing data.

There's one aspect of this that I'm slightly confused about, though. According to Redshift's guide on managing data consistency:

All regions provide read-after-write consistency for uploads of new objects with unique object keys. [...] Amazon S3 provides eventual consistency in all regions for overwrite operations. Creating new file names, or object keys, in Amazon S3 for each data load operation provides strong consistency in all regions.

We always produce new keys, so according to this I think our S3 writes should appear to be strongly consistent as long as the S3 bucket and Redshift cluster are in the same AWS region.

Do you happen to be doing a cross-region copy (i.e. is your S3 bucket in a different AWS region than your Redshift cluster)?

JoshRosen avatar Dec 11 '15 20:12 JoshRosen

Hi @JoshRosen,

thanks for your reply, here's a overview of our setup:

  • Spark 1.5.0 on YARN (Amazon EMR 4.1)
  • Spark-Redshift 0.5.2
  • Scala 2.10 / Java 7
  • Redshift JDBC 4.1 (1.1.7.1007)

The job is using the new manifest.json and keeps generating new files every time- it however as well uses 24 partitions and therefore 24 files are written to S3 every time - which might cause issues with S3? I'm a bit confused about the S3 consistency rules as well. The only thing I did notice, is that the second job repartitions to the number of Redshift nodes (8) due to a shuffle operation in the middle of the job and that significantly improves stability (however does not ensure it completely). I'll give repartitioning to the number of Redshift nodes a try.

EMR, Redshift and the S3 Bucket are all in the eu-west1 region. However the s3n:// URI that I use for the temp folder does not include the region name.

With the job running every 5 minutes from the same machine, UUID collision should be highly unlikely.

lehnerm avatar Dec 14 '15 11:12 lehnerm

Been doing some research on s3 consistency. From the S3 docs, one would expect that the observed behavior is not possible when dataFrame.save() in spark is a fully blocking operation.

I'm trying to get hold of somebody at AWS that can shed some light on the details of S3s read after write consistency for us.

As you are anyway listing the bucket before issuing the copy command in #99, I'll add some diagnostic logging to track whether at least EMRs view of S3 is always fully consistent. The unfortunate truth is however, if that were the case we can't really rely on EMRs and Redshifts view of S3 to be the same.

lehnerm avatar Dec 16 '15 19:12 lehnerm

@lehnerm, I think that I may have some contacts on the S3 and Redshift teams, so I'll forward this thread to them to see if they have any insights.

Ping @aarondav, this is that weird S3 issue that I mentioned yesterday.

JoshRosen avatar Dec 16 '15 20:12 JoshRosen

Taking another look at the Amazon S3 Data Consistency Model docs (emphasis mine):

Amazon S3 achieves high availability by replicating data across multiple servers within Amazon's data centers. If a PUT request is successful, your data is safely stored. However, information about the changes must replicate across Amazon S3, which can take some time, and so you might observe the following behaviors:

  • A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.
  • [...]

According to the S3 FAQ:

Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES.

On the AWS forum, someone asked about how to reconcile these statements, and according to ChrisP@AWS:

Read-after-write consistency is only valid for GETS of new objects - LISTS might not contain the new objects until the change is fully propagated.

So given this, if we used a LIST operation we might silently skip missing files. But since we've specified all filenames in a manifest, I would imagine that Redshift would directly issue GETs for the files listed in the manifest and hence would see the object.

JoshRosen avatar Dec 16 '15 20:12 JoshRosen

In addition, according to the documentation for Redshift manifest files, emphasis mine:

You can explicitly specify which files to load by using a manifest file. When you use a manifest file, COPY enforces strong consistency by searching secondary servers if it does not find a listed file on the primary server. The manifest file can be configured with an optional mandatory flag. If mandatory is true and the file is not found, COPY returns an error.

JoshRosen avatar Dec 16 '15 20:12 JoshRosen

@lehnerm, a couple of other questions that I just thought of:

  • Have you configured and Hadoop or Spark OutputCommitter settings to be different than their defaults?
  • Are you using some sort of DirectOutputCommitter?
  • Is Spark's speculative execution enabled?

As of Spark 1.5.0, I'm not aware of any bugs related to these components which would explain the behavior that you saw here, but I just wanted to check.

JoshRosen avatar Dec 16 '15 21:12 JoshRosen

We know that this is clustered - we know that we can beat a request if it hasn't propagated - the level of consistency is "strong" - not "absolute" - using a manifest and the "mandatory" boolean allows us to know when we failed to get the object - so it looks like we are protected - when we fail to get the object, we know - so this is probably working as designed... I suspect that the AWS documentation could be made more clear, but the fact is that we have a closed control loop - we're not going to trip and fall - just have to re-take a step sometimes when we run very fast.

lankygit avatar Dec 17 '15 15:12 lankygit

Hi @JoshRosen,

sorry for my late response, to answer your questions: Have you configured and Hadoop or Spark OutputCommitter settings to be different than their defaults? No

Are you using some sort of DirectOutputCommitter? No, guess that follows from 1

Is Spark's speculative execution enabled? If it's not enabled by default, No

Would adding a retry mechanism with catching the Redshift exception and then repeating just the COPY statement after a configurable delay be valid solution from your perspective? Seems like we are only seeing minor consistency glitches for a couple of seconds, and once that happens we can wait for the file to appear.

lehnerm avatar Dec 21 '15 16:12 lehnerm

Based on a more careful reading of the announcement about the availability of strong read-after-write consistency in all regions (https://forums.aws.amazon.com/ann.jspa?annID=3112), it sounds like this might actually be saying that each region has a endpoint which provides these consistency guarantees, not that all endpoints in all regions have them. I guess one question is whether Redshift uses these endpoints for its writes and reads and whether our AWS client does the same. If possible, I wonder if we can pin spark-redshift to use the s3-external-1.amazonaws.com endpoint, which definitely supports the newer, stronger consistency guarantees.

JoshRosen avatar Jan 12 '16 01:01 JoshRosen

Is this still being looked into?

camspilly avatar May 17 '16 18:05 camspilly

Anyone looking into this issue? I'm getting the same error. @lehnerm : did you have any work-around for this?

caseyvu avatar Jun 27 '16 03:06 caseyvu

Has there been any update on this issue?

camspilly avatar Sep 01 '16 13:09 camspilly

we are hitting the same issue occasionally and we are using spark 1.6.2 and "com.databricks" %% "spark-redshift" % "1.1.0". Any updates on this?

tzhang101 avatar Sep 06 '16 21:09 tzhang101

The S3 docs also say that:

Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all regions with one caveat. The caveat is that if you make a HEAD or GET request to the key name (to find if the object exists) before creating the object, Amazon S3 provides eventual consistency for read-after-write.

(Emphasis mine.)

Does anybody who understands the nitty-gritty of FileOutputCommitter better know if it indeed ends up triggering a GET or HEAD prior to writing? (Assuming algorithm 2, since that's the new default.)

pnc avatar May 01 '17 12:05 pnc

Hi @JoshRosen , when i run Redshift COPY command and get a error "Problem reading manifest file - S3ServiceException:The specified key does not exist.,Status 404". I dont known what this error? image

lchhieu avatar May 10 '17 02:05 lchhieu

@ichhieu, can you post the exact command (perhaps with any role or IAM info scrubbed) you're running that triggers this error? Also, does the error happen every time, or just on some runs?

pnc avatar May 10 '17 03:05 pnc

@pnc , thank you

lchhieu avatar May 10 '17 04:05 lchhieu

I was discussing the consistency issue with someone from Databricks at a recent Spark summit, and they thought that using EMRFS's consistency option would alleviate it. I'm here to report that it does not seem to... We (very occasionally) get this error for an individual partition file when trying to write a DataFrame to Redshift:

S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey

I asked AWS support to look into it and they said EMRFS would not even be used since Redshift COPY is not a Hadoop job. They suggested waiting 5-10 seconds between writing the files to S3 and calling upon Redshift to COPY, in order to avoid the error. Evidently S3 also 'negatively' caches GET and HEAD requests for up to 90 seconds, so subsequent requests must wait that long after getting the error.

Is there no way to add an option to spark-redshift that injects a little wait time? I'm using their RedshiftJDBC42-1.1.17.1017 driver - might I have better luck with the standard one?

syntropo avatar May 11 '17 15:05 syntropo

Any update on fixing this issue? AWS Glue constantly throws aa error. Support is referencing to this issue. Any knowing workaround?

sphinks avatar Jan 14 '19 13:01 sphinks

+1 This is affecting our larger load run and had to setup re-rerunning the job everytime

imranece59 avatar Jan 17 '19 03:01 imranece59

Any update on the above issue.

SumitMoodys avatar Jan 10 '20 14:01 SumitMoodys