aws-sdk-java icon indicating copy to clipboard operation
aws-sdk-java copied to clipboard

TransferMonitor: CopyImpl.waitForCopyResult() never returns if there is a failure.

Open garypgithub opened this issue 5 years ago • 6 comments

I am initiating a copy by calling TransferManager.copy with a CopyObjectRequest, an AmazonS3 (client), and a TransferStateChangeListener. When the TransferManager's transferStateChange method is called, I check to see if the TransferState is "failed". If so, I call Copy.waitForCopyResult() on the Copy object returned from my call to TransferManager.copy. I do that so that I catch the exception generated by the copy request and log it. My problem is that, under certain failure circumstances, the copy is never marked as done and so the Copy.waitForCopyResult() never returns and I am hung up there forever.

Recently, we had an error in one of our Permission JSONs for a particular IAM user. This caused the s3.copyObject(..) at line 145 in method copyInOneChunk of class CopyCallable to blow up with a 403 return code from S3. The exception thrown was propagated up to the call() method of class CopyMonitor at line 132. This, in turn, was caught by the catch(Exception) at line 146 of CopyMonitor. From there, a failed state was set into the TransferStateChangeListener and my listener was called.

As stated above, my listener tried to do a Copy.waitForCopyResult(). However this never returned because the waitForCopyResult in CopyImpl tries to do a get() on the Future returned by CopyMonitor.getFuture(). When get() is called by CopyImpl at line 63, this, in turn, calls get() on the Future for CopyMonitor which hangs because CopyMonitor never completed since it is in the middle of the exception processing at line 147.

I realize that all of this is very confusing. To recreate though, it's pretty simple. Write a small program to try to use TransferManager.copy to copy to a bucket to which you do not have permission. In invokeing TransferManager.copy, supply a class that implements TransferStateChangeListener and, in the transferStateChange method, if the state is Failed, execute Copy.waitForCopyResult() and you'll see the hang. In your class that implements, TransferStateChangeListener, you'll have to have a set method that accepts the Copy object returned from your call to TransferManager.copy. Please let me know if you need me to send you a small test case that recreates this issue.

Thanks, Gary

garypgithub avatar Mar 25 '20 23:03 garypgithub

Hi @garypgithub, I appreciate you taking the time to describe the problem. Yes, a small test case would help in reproducing the issue, it speeds up and makes the investigation process easier.

debora-ito avatar Mar 27 '20 19:03 debora-ito

I'm closing this for now, but feel free to reopen when you get the chance to provide a minimal working example.

debora-ito avatar Apr 22 '20 21:04 debora-ito

Sorry for the delay. Here are two simple classes that demonstrate the issue. The user must have the access key id and access key secret set up so that she can access the source bucket and source object. The destination bucket should not exist. Also be sure to set the client region to the region where the bucket is located. Notice that the message "Finished waiting for copy result" never appears because we are hanging on the waitForCopyResult. issue2278.zip

@debora-ito I am not able to reopen this issue since you closed it and you are a Collaborator. Can you please open the issue for me? Thanks.

garypgithub avatar Apr 23 '20 05:04 garypgithub

I've looked into this a bit more because this bug is killing us at the moment. We're transferring s3 objects between buckets, sometime in different accounts, and if we don't have permissions to the destination bucket, our code just hangs at the copy.waitForCopyResult() statement. I've come up with a possible solution. Unfortunately, if the user passes in his or her own ExecutorService whose submit method returns a Future that is not a FutureTask, options are limited.

Line numbers here refer to lines in the com.amazonaws.services.s3.transfer.internal.CopyMonitor revision that is tagged with 1.11.772 which is current as of today. However, this code hasn't change in a while. The problem occurs when line 132 (CopyResult result = multipartCopyCallable.call();) throws an error. In our case, this is because the s3 client doesn't have access to the target bucket and object. This is caught in line 146. Control is given to the listener but the Copy object is not set up properly which causing the copy.waitForCopyResult() to hang. This code, when inserted between lines 147 and 148, fixes the problem for us:

Future<CopyResult> f = getFuture();
if (f != null) {
  if (f instanceof FutureTask) {
    // If it's a FutureTask, we're able to reflect the actual exception back to the caller
    ((FutureTask) f).setException(e);
  } else if (f instanceof CompletableFuture) {
    ((CompletableFuture) f).completeExceptionally(e);
  } else if (f instanceof ...) {
    ((...) f).completeExceptionally(e);  // Or other similar method
  } else {
    // It's a Future but not one we know about so all we can do is cancel
    f.cancel(true);
  }
}

You can enumerate all of the standard classes that implement Future if you'd like. We pass in an ExecutorService that happens to return a FutureTask from its submit method so this works for us. YMMV

garypgithub avatar Apr 30 '20 04:04 garypgithub

I realized that the above solution won't work because setException() (in FutureTask) is protected and therefore not accessible in this context. I think the only safe thing to do is cancel. However, by calling cancel, you can't reflect the exception back up to the caller. Since the exception is never logged anywhere, it's pretty much impossible for the user to know what the exception was.

Whatever you can do to improve this situation would be appreciated as it is causing us major problems.. Perhaps you'd consider adding a failureReason field and getFailureReason() call to the Transfer interface. This would allow you to reflect the reason for a Failed status back to the caller.

Thanks, Gary

garypgithub avatar May 05 '20 04:05 garypgithub

Can someone from the AWS team please review this issue and provide some feedback?

Thanks, Gary

garypgithub avatar Jun 02 '20 18:06 garypgithub

@garypgithub I'm sorry for losing track of this issue. Are you still being bit by this?

I can't repro the error with a more recent version of the SDK (I used 1.12.435). I used the project in zip file you provided, I have access to the source bucket, the source file has 25MB, and I don't have access to the destination bucket.

Before reaching the listener, the Copy call fails with 404 Not Found - 2023-03-28 19:05:23,528 [main] DEBUG com.amazonaws.request - Received error response: com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: xxx; S3 Extended Request ID: xxx

Let me know if you still want us to investigate this.

debora-ito avatar Mar 29 '23 02:03 debora-ito

It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.

github-actions[bot] avatar Apr 03 '23 03:04 github-actions[bot]