aws-c-s3
aws-c-s3 copied to clipboard
[auto-ranged PUT]: CompleteMultipartUpload called while 1 multipart upload is missing
CompleteMultipartUpload
in auto-ranged-PUT failed due to a missing second (and final) UploadPart
.
We ran into this problem with aws-c-s3
0.1.51 and aws-sdk-cpp
1.10.54 on Linux
Problem description
Our API issued an S3CrtClient->PutObject
, and it resulted in the following error:
Expected: 'x is ok', with x := 'output_blob_->close()' [av::status::Status]
x = PutObject() failed
where: cloud/aws/s3/s3_streambuf.cc:93
extra: s3://perception-prod-training-data/opt/a831200c/s2a/2023-02-15-bless-collect_dking_updateOverlapFeb10_latestIssues/test/36c6f8923fe514d6b5a28ac5dbdea034.rats: HTTP response code: 400
Resolved remote host IP address:
Request ID: 2BAQ8WRZGGH3PPG1
Exception name: InvalidPart
Error message: Unable to parse ExceptionName: InvalidPart Message: One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.
7 response headers:
connection : close
content-type : application/xml
date : Wed, 15 Feb 2023 16:39:26 GMT
server : AmazonS3
transfer-encoding : chunked
x-amz-id-2 : oFLatf5OZVV1Ny34iYIEhqA3Ft/+QRXSLzz6/K/c36nd4grUqR6kmhKJS/U32GftV1/GdmVEbOY=
x-amz-request-id : 2BAQ8WRZGGH3PPG1
Backtrace (most recent call first)
#10 at 0x560a5b826f76 in av::cloud::aws::s3::S3Ostreambuf::PutObject()
#9 at 0x560a5b82b63d in av::cloud::aws::s3::S3Ostreambuf::close()
#8 at 0x560a5b822170 in av::cloud::aws::s3::S3Ostream::close()
#7 at 0x560a5af329b8 in av::perception::s2a::DataExtractionModule::on_shutdown()
#6 at 0x560a5bf40593 in av::framework::Module::on_executor_shutdown()
#5 at 0x560a5ac5ed91 in av::detail::FuncImpl<>::invoke()
#4 at 0x560a5beacc8b in av::BaseThreadPool::do_next()
#3 at 0x560a5beacecd in std::thread::_State_impl<>::_M_run()
#2 at 0x7fc1887c76df in <?>
#1 at 0x7fc188ca26db in start_thread
The PutObject
in frame 10 invokes the S3CrtClient->PutObject
call.
Further investigation showed that the first UploadPart succeeded (visible in list-parts
, there was no evidence (neither list-parts
nor API logs) that the second UploadPart
completed.
In our logs, there was no further aws-c-s3
error indicating a failed operation.
Open Question
The CompleteMultipartUpload
request uses the ETags of the completed requests, so how could the CompleteMultiPartUpload
have been sent with the ETag of the second UploadPart?
Perhaps it was sent with only 1 ETag (that of the first, successfully completed UploadPart).
Hi,
Thank you for reporting the issue. Could you please attach the full CRT logs and provide some reproduction steps? Thank you!
@waahm7 - the problem is not easily reproducible, it seems to be a timing-related problem. So far, after 10s of thousands of batch jobs, we have only data for this instance. (It is possible that this has happened more frequently in production, since batch jobs are retried on failure.).
With regard to the CRT logs, we log at WARN
level to stderr
; the only other log entry other than the above error trace was this one:
[ERROR] 2023-02-15 16:39:26.623 S3MetaRequest [140468667021056] id=0x7fc125951d00 Meta request cannot recover from error 14343 (Invalid response status from request). (request=0x7fc15e00a600, response status=400)
terminate called after throwing an instance of 'av::CheckException'
what(): Check failure at perception/s2a/dataset_extraction/data_extraction_module.cc:574:
Expected: 'x is ok', with x := 'output_blob_->close()' [av::status::Status]
x = PutObject() failed
where: cloud/aws/s3/s3_streambuf.cc:93
extra: s3://perception-prod-training-data/opt/a831200c/s2a/2023-02-15-bless-collect_dking_updateOverlapFeb10_latestIssues/test/36c6f8923fe514d6b5a28ac5dbdea034.rats: HTTP response code: 400
Resolved remote host IP address:
Request ID: 2BAQ8WRZGGH3PPG1
[... rest as above]
The following is the summary of a multi-week-long effort to narrow down the cause of the problem:
- The
CompleteMultipartUpload
that failed with above error had the following data:
opt/a831200c/s2a/2023-02-15-bless-collect_dking_updateOverlapFeb10_latestIssues/test/36c6f8923fe514d6b5a28ac5dbdea034.rats
Object Size: -1 Bytes
Part Count: 1-2
Multipart Upload ID: bSMw15PyZS8eNZgoSxZ.qdZw4Awfm4beQGf9ERvdc8izMN5yFd6quULmunaMpJ8gUh2ZAjjidDlVr97lOBlSwhssgFeb_Zsb5CHN5fPYX6qHLjyRm7uxJZtnPerZGhbwsBU5xNStVDsjlr.Qnhe18s387J.7VybyeA2H9yrUnfk-
- As per the
Part Count
field, it seemed to expect 2 parts. However, only one part was uploaded: theCompleteMultipartUpload
with ID 2BAQ8WRZGGH3PPG1 was preceded by 1UploadPart
with ID 2BAR2P5XFDM71V14 on the same TCP connection. There was no trace of anotherUploadPart
call. - The s3 key of the failed multipart upload did not show up in
list-multipart-uploads
or inlist-parts
.
Since the CompleteMultipartUpload
request requires adding the ETag
values returned by the individual UploadPart
responses into the XML request body, the cause may be that either
- both
UploadPart
calls completed, but the second one failed after the API response returned theETag
(seems unlikely); or - due some kind of race condition or unsuitable timing, the
CompleteMultipartUpload
"thought" there was only 1 instead of 2ETags
/UploadParts
present, and thus tried to complete the multipart upload with a singleUploadPart
.
Thank you for the details. Is the request getting paused and resumed later?
No, it does not use aws_s3_meta_request_pause
.
@grrtrr there have been some update since this issue was opened. Are you still running into this with the latest version of this repo?
@jmklix what are the updates and how do they fix the issue described here? In particular, have test cases been added to ensure the condition does not happen?