aws-c-s3 icon indicating copy to clipboard operation
aws-c-s3 copied to clipboard

[auto-ranged PUT]: CompleteMultipartUpload called while 1 multipart upload is missing

Open grrtrr opened this issue 1 year ago • 6 comments

CompleteMultipartUpload in auto-ranged-PUT failed due to a missing second (and final) UploadPart.

We ran into this problem with aws-c-s3 0.1.51 and aws-sdk-cpp 1.10.54 on Linux

Problem description

Our API issued an S3CrtClient->PutObject, and it resulted in the following error:

Expected: 'x is ok', with x := 'output_blob_->close()' [av::status::Status]
  x = PutObject() failed
  where: cloud/aws/s3/s3_streambuf.cc:93
  extra: s3://perception-prod-training-data/opt/a831200c/s2a/2023-02-15-bless-collect_dking_updateOverlapFeb10_latestIssues/test/36c6f8923fe514d6b5a28ac5dbdea034.rats: HTTP response code: 400
Resolved remote host IP address: 
Request ID: 2BAQ8WRZGGH3PPG1
Exception name: InvalidPart
Error message: Unable to parse ExceptionName: InvalidPart Message: One or more of the specified parts could not be found.  The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.
7 response headers:
connection : close
content-type : application/xml
date : Wed, 15 Feb 2023 16:39:26 GMT
server : AmazonS3
transfer-encoding : chunked
x-amz-id-2 : oFLatf5OZVV1Ny34iYIEhqA3Ft/+QRXSLzz6/K/c36nd4grUqR6kmhKJS/U32GftV1/GdmVEbOY=
x-amz-request-id : 2BAQ8WRZGGH3PPG1

Backtrace (most recent call first)
#10	 at 0x560a5b826f76 in av::cloud::aws::s3::S3Ostreambuf::PutObject()
#9	 at 0x560a5b82b63d in av::cloud::aws::s3::S3Ostreambuf::close()
#8	 at 0x560a5b822170 in av::cloud::aws::s3::S3Ostream::close()
#7	 at 0x560a5af329b8 in av::perception::s2a::DataExtractionModule::on_shutdown()
#6	 at 0x560a5bf40593 in av::framework::Module::on_executor_shutdown()
#5	 at 0x560a5ac5ed91 in av::detail::FuncImpl<>::invoke()
#4	 at 0x560a5beacc8b in av::BaseThreadPool::do_next()
#3	 at 0x560a5beacecd in std::thread::_State_impl<>::_M_run()
#2	 at 0x7fc1887c76df in <?>
#1	 at 0x7fc188ca26db in start_thread

The PutObject in frame 10 invokes the S3CrtClient->PutObject call.

Further investigation showed that the first UploadPart succeeded (visible in list-parts, there was no evidence (neither list-parts nor API logs) that the second UploadPart completed.

In our logs, there was no further aws-c-s3 error indicating a failed operation.

Open Question

The CompleteMultipartUpload request uses the ETags of the completed requests, so how could the CompleteMultiPartUpload have been sent with the ETag of the second UploadPart?

Perhaps it was sent with only 1 ETag (that of the first, successfully completed UploadPart).

grrtrr avatar Mar 16 '23 18:03 grrtrr

Hi,

Thank you for reporting the issue. Could you please attach the full CRT logs and provide some reproduction steps? Thank you!

waahm7 avatar Mar 22 '23 15:03 waahm7

@waahm7 - the problem is not easily reproducible, it seems to be a timing-related problem. So far, after 10s of thousands of batch jobs, we have only data for this instance. (It is possible that this has happened more frequently in production, since batch jobs are retried on failure.).

With regard to the CRT logs, we log at WARN level to stderr; the only other log entry other than the above error trace was this one:

[ERROR] 2023-02-15 16:39:26.623 S3MetaRequest [140468667021056] id=0x7fc125951d00 Meta request cannot recover from error 14343 (Invalid response status from request). (request=0x7fc15e00a600, response status=400)
terminate called after throwing an instance of 'av::CheckException'
  what():  Check failure at perception/s2a/dataset_extraction/data_extraction_module.cc:574:
  Expected: 'x is ok', with x := 'output_blob_->close()' [av::status::Status]
  x = PutObject() failed
  where: cloud/aws/s3/s3_streambuf.cc:93
  extra: s3://perception-prod-training-data/opt/a831200c/s2a/2023-02-15-bless-collect_dking_updateOverlapFeb10_latestIssues/test/36c6f8923fe514d6b5a28ac5dbdea034.rats: HTTP response code: 400
Resolved remote host IP address: 
Request ID: 2BAQ8WRZGGH3PPG1
[... rest as above]

The following is the summary of a multi-week-long effort to narrow down the cause of the problem:

  1. The CompleteMultipartUpload that failed with above error had the following data:
opt/a831200c/s2a/2023-02-15-bless-collect_dking_updateOverlapFeb10_latestIssues/test/36c6f8923fe514d6b5a28ac5dbdea034.rats
Object Size: -1 Bytes
Part Count: 1-2
Multipart Upload ID: bSMw15PyZS8eNZgoSxZ.qdZw4Awfm4beQGf9ERvdc8izMN5yFd6quULmunaMpJ8gUh2ZAjjidDlVr97lOBlSwhssgFeb_Zsb5CHN5fPYX6qHLjyRm7uxJZtnPerZGhbwsBU5xNStVDsjlr.Qnhe18s387J.7VybyeA2H9yrUnfk-
  1. As per the Part Count field, it seemed to expect 2 parts. However, only one part was uploaded: the CompleteMultipartUpload with ID 2BAQ8WRZGGH3PPG1 was preceded by 1 UploadPart with ID 2BAR2P5XFDM71V14 on the same TCP connection. There was no trace of another UploadPart call.
  2. The s3 key of the failed multipart upload did not show up in list-multipart-uploads or in list-parts.

Since the CompleteMultipartUpload request requires adding the ETag values returned by the individual UploadPart responses into the XML request body, the cause may be that either

  • both UploadPart calls completed, but the second one failed after the API response returned the ETag (seems unlikely); or
  • due some kind of race condition or unsuitable timing, the CompleteMultipartUpload "thought" there was only 1 instead of 2 ETags/UploadParts present, and thus tried to complete the multipart upload with a single UploadPart.

grrtrr avatar Mar 22 '23 17:03 grrtrr

Thank you for the details. Is the request getting paused and resumed later?

waahm7 avatar Mar 23 '23 20:03 waahm7

No, it does not use aws_s3_meta_request_pause.

grrtrr avatar Mar 23 '23 20:03 grrtrr

@grrtrr there have been some update since this issue was opened. Are you still running into this with the latest version of this repo?

jmklix avatar Nov 15 '23 19:11 jmklix

@jmklix what are the updates and how do they fix the issue described here? In particular, have test cases been added to ensure the condition does not happen?

grrtrr avatar Nov 15 '23 20:11 grrtrr