milvus
milvus copied to clipboard
[Bug]: streamingnode: AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload
Is there an existing issue for this?
- [x] I have searched the existing issues
Environment
- Milvus version: v2.6.3 (helm chart 5.0.4)
- Deployment mode: cluster
- MQ type: pulsar
Current Behavior
We have fresh milvus installation and streamingnode component got crashed all the time with AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload error
milvus-cluster-streamingnode-6695d4cd6d-2gf5p 0/1 CrashLoopBackOff 193 (4m2s ago) 21h
milvus-cluster-streamingnode-6695d4cd6d-m6v4c 0/1 Running 192 (6m36s ago) 21h
it was working fine with the same s3 server during milvus 2.4.x and we have problems when we trying 2.6.x
Expected Behavior
No response
Steps To Reproduce
log:
level: debug
cluster:
enabled: true
streaming:
enabled: true
image:
all:
repository: harbor/proxy-public-hub-docker-com/milvusdb/milvus
tools:
repository: harbo/proxy-public-hub-docker-com/milvusdb/milvus-config-tool
etcd:
global:
imageRegistry: harbor/proxy-public-hub-docker-com
replicaCount: 3
persistence:
storageClass: pcz-ha-zone-1-pv-latebinding
pulsarv3:
enabled: true
images:
zookeeper:
repository: harbor/proxy-public-hub-docker-com/apachepulsar/pulsar
bookie:
repository: harbor/proxy-public-hub-docker-com/apachepulsar/pulsar
autorecovery:
repository: harbor/proxy-public-hub-docker-com/apachepulsar/pulsar
broker:
repository: harbor/proxy-public-hub-docker-com/apachepulsar/pulsar
toolset:
repository: harbo/proxy-public-hub-docker-com/apachepulsar/pulsar
proxy:
repository: harbo/proxy-public-hub-docker-com/apachepulsar/pulsar
functions:
repository: harbor/proxy-public-hub-docker-com/apachepulsar/pulsar
pulsar_manager:
repository: harbor/proxy-public-hub-docker-com/apachepulsar/pulsar-manager
bookkeeper:
volumes:
persistence: true
journal:
size: 10Gi
storageClassName: pcz-ha-zone-1-pv-latebinding
ledgers:
size: 50Gi
storageClassName: pcz-ha-zone-1-pv-latebinding
zookeeper:
volumes:
persistence: true
data:
size: 20Gi
storageClassName: pcz-ha-zone-1-pv-latebinding
minio:
enabled: false
image:
repository: harbor/proxy-public-hub-docker-com/minio/minio
mode: distributed
persistence:
storageClass: pcz-ha-zone-1-pv-latebinding
externalS3:
enabled: true
host: ""
port: "8082"
accessKey: ""
secretKey: ""
useSSL: true
bucketName: "abc"
rootPath: ""
useIAM: false
cloudProvider: "aws"
iamEndpoint: ""
region: ""
useVirtualHost: false
standalone:
replicas: 3
persistence:
persistentVolumeClaim:
storageClass: pcz-ha-zone-1-pv-latebinding
proxy:
replicas: 3
rootCoordinator:
enabled: false
replicas: 3
activeStandby:
enabled: true
queryCoordinator:
enabled: false
replicas: 3
activeStandby:
enabled: true
queryNode:
replicas: 3
indexNode:
replicas: 3
enabled: false
dataCoordinator:
enabled: false
replicas: 3
activeStandby:
enabled: true
dataNode:
replicas: 3
resources:
limits:
cpu: 2000m
memory: 2Gi
requests:
cpu: 1000m
memory: 1Gi
streamingNode:
replicas: 3
resources:
limits:
cpu: 2000m
memory: 2Gi
requests:
cpu: 1000m
memory: 1Gi
mixCoordinator:
replicas: 3
enabled: true
activeStandby:
enabled: true
indexCoordinator:
enabled: false
replicas: 3
activeStandby:
enabled: true
metrics:
serviceMonitor:
enabled: true
service:
type: LoadBalancer
Milvus Log
E20251014 21:46:35.823302 24 io_util.h:23] [STORAGE][CloseFromDestructor][milvus] IOError: When destroying file of type N14milvus_storage18CustomOutputStreamE: When completing multiple part upload for key 'insert_log/461478028016095115/461478028016095116/461478028017295145/0/461478028018094347' in bucket 'pcz-dev-elevate-milvus-cluster': AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
W20251014 21:46:35.823572 24 ExceptionTracer.cpp:187] Invalid trace stack for exception of type: std::runtime_error
terminate called after throwing an instance of 'std::runtime_error'
what(): When destroying file of type N14milvus_storage18CustomOutputStreamE: When completing multiple part upload for key 'insert_log/461478028016095115/461478028016095116/461478028017295145/0/461478028018094347' in bucket 'pcz-dev-elevate-milvus-cluster': AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
SIGABRT: abort
PC=0x7f0584fca9fc m=8 sigcode=18446744073709551610
signal arrived during cgo execution
Anything else?
we tested multipart s3 upload from the same pod to the same s3 server and it works:
bash s3-multipart-test-v1.sh 🔧 Generating 10MB x 5 test file... 🪄 Starting multipart upload... ✅ UploadId: KzjbiVjXXjFGbGluzHGC4m8txmFUJC0SmJs2JtJDdWiFPHyTu9Lz6mFO6Q ⬆️ Uploading part 1 (part-aa)... ETag: "f1c9645dbc14efddc7d8a322685f26eb" ⬆️ Uploading part 2 (part-ab)... ETag: "f1c9645dbc14efddc7d8a322685f26eb" ⬆️ Uploading part 3 (part-ac)... ETag: "f1c9645dbc14efddc7d8a322685f26eb" ⬆️ Uploading part 4 (part-ad)... ETag: "f1c9645dbc14efddc7d8a322685f26eb" ⬆️ Uploading part 5 (part-ae)... ETag: "f1c9645dbc14efddc7d8a322685f26eb" 🧾 Completing multipart upload... { "Location": "https://xxxtest/multipart-test.bin", "Bucket": "xxx", "Key": "test/multipart-test.bin", "ETag": ""b112a68f6cb4e22726d733bdaf03535a-5"" } ✅ Multipart upload completed successfully! 🧹 Cleaning up temp files... 🎉 Done.
/assign @LoveEachDay /unassign
/assign @shaoting-huang
Hi @dmitryzykov, thanks for the report!
Starting from v2.6.x, Milvus introduced multipart upload in the storage layer. This error message usually indicates that the multipart upload ID is invalid — it might have been aborted, completed, or cleaned up by the server.
Here are a few things you can double-check:
- Check if any network timeouts or retries happened during upload.
- Look in the logs for any message like:
When aborting multiple part upload for key.
If possible, please share the full Milvus logs or any reproducible materials with us. If the logs are too large, you can contact us directly at [email protected].
Hi @shaoting-huang I provided logs to your email.
@shaoting-huang any progress on this?
Hi @dmitryzykov
I attempted to reproduce the issue but was unable to observe the same behavior.
Could you please confirm whether the issue is consistently reproducible? If so, kindly provide the SDK scripts you used so that we can reproduce the issue on our end.
Additionally, please check whether there are any potential network issues that might be contributing to the problem.
Hi @shaoting-huang We tested again with latest 2.6.4 And it works with minio and AWS S3 external storages, but it doesn't work with NetApp StorageGRID 11.9.0.9.
Our steps to reproduce: On completely fresh installation of milvus we try to restore backup and now in data logs we can see the same error:
[2025/11/05 23:19:46.367 +00:00] [INFO] [importv2/task_import.go:277] ["start to sync import data"] [taskID=461997475248416881] [jobID=461997475248214984] [collectionID=461997475248214953] [type=ImportTask]
[2025/11/05 23:19:46.367 +00:00] [INFO] [syncmgr/sync_manager.go:159] ["sync mgr sumbit task with key"] [key=461997475248416882]
[2025/11/05 23:19:46.367 +00:00] [INFO] [syncmgr/task.go:244] ["sync new split columns"] [segmentID=461997475248416882] [columnGroups="[\"[GroupID: 0, ColumnIndices: [0 3 4], Fields: [100 0 1]]\",\"[GroupID: 1, ColumnIndices: [2], Fields: [102]]\",\"[GroupID: 101, ColumnIndices: [1], Fields: [101]]\"]"]
[2025/11/05 23:19:46.486 +00:00] [INFO] [importv2/task_import.go:277] ["start to sync import data"] [taskID=461997475248416881] [jobID=461997475248214984] [collectionID=461997475248214953] [type=ImportTask]
[2025/11/05 23:19:46.486 +00:00] [INFO] [syncmgr/sync_manager.go:159] ["sync mgr sumbit task with key"] [key=461997475248416882]
[ERROR] 2025-11-05 23:19:46.673 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:46.673 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[WARN] 2025-11-05 23:19:46.673 EC2MetadataClient [140709629756992] Request failed, now waiting 0 ms before attempting again.
[ERROR] 2025-11-05 23:19:47.675 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:47.675 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[ERROR] 2025-11-05 23:19:47.675 EC2MetadataClient [140709629756992] Can not retrieve resource from http://169.254.169.254/latest/api/token
[ERROR] 2025-11-05 23:19:48.676 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:48.676 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[WARN] 2025-11-05 23:19:48.676 EC2MetadataClient [140709629756992] Request failed, now waiting 0 ms before attempting again.
[ERROR] 2025-11-05 23:19:49.677 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:49.678 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[ERROR] 2025-11-05 23:19:49.678 EC2MetadataClient [140709629756992] Can not retrieve resource from http://169.254.169.254/latest/meta-data/iam/security-credentials
[ERROR] 2025-11-05 23:19:50.679 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:50.680 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[WARN] 2025-11-05 23:19:50.680 EC2MetadataClient [140709629756992] Request failed, now waiting 0 ms before attempting again.
[ERROR] 2025-11-05 23:19:51.681 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:51.681 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[ERROR] 2025-11-05 23:19:51.681 EC2MetadataClient [140709629756992] Can not retrieve resource from http://169.254.169.254/latest/meta-data/placement/availability-zone
[ERROR] 2025-11-05 23:19:52.683 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:52.683 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[WARN] 2025-11-05 23:19:52.683 EC2MetadataClient [140709629756992] Request failed, now waiting 0 ms before attempting again.
[ERROR] 2025-11-05 23:19:53.685 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:53.685 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[ERROR] 2025-11-05 23:19:53.685 EC2MetadataClient [140709629756992] Can not retrieve resource from http://169.254.169.254/latest/meta-data/placement/availability-zone
I20251105 23:19:53.715765 33 scope_metric.cpp:49] [SERVER][~FuncScopeMetric][milvus][][CGO Call] slow function NewPackedWriterWithStorageConfig done with duration 7.997663294s
I20251105 23:19:53.716043 20 scope_metric.cpp:49] [SERVER][~FuncScopeMetric][milvus][][CGO Call] slow function NewPackedWriterWithStorageConfig done with duration 7.31485271s
I20251105 23:19:53.717676 34 scope_metric.cpp:49] [SERVER][~FuncScopeMetric][milvus][][CGO Call] slow function NewPackedWriterWithStorageConfig done with duration 8.059913749s
[ERROR] 2025-11-05 23:19:54.021 CurlHttpClient [140709638149696] Curl returned error code 56 - Failure when receiving data from the peer
[ERROR] 2025-11-05 23:19:54.022 AWSXmlClient [140709638149696] HTTP response code: -1
Resolved remote host IP address: 10.240.96.17
Request ID:
Exception name:
Error message: curlCode: 56, Failure when receiving data from the peer
7 response headers:
connection : CLOSE
content-type : application/xml
date : Wed, 05 Nov 2025 23:19:53 GMT
server : StorageGRID/11.9.0.9
x-amz-id-2 : 12953738
x-amz-request-id : 1762384738467820
x-ntap-sg-trace-id : ba4df73e38005d04
[WARN] 2025-11-05 23:19:54.022 AWSClient [140709638149696] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
[WARN] 2025-11-05 23:19:54.022 AWSClient [140709638149696] Request failed, now waiting 200 ms before attempting again.
[WARN] 2025-11-05 23:19:54.237 AWSErrorMarshaller [140709638149696] Encountered AWSError 'NoSuchUpload': The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
[WARN] 2025-11-05 23:19:54.237 AWSErrorMarshaller [140709638149696] Encountered AWSError 'NoSuchUpload': The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
[ERROR] 2025-11-05 23:19:54.237 AWSXmlClient [140709638149696] HTTP response code: 404
Resolved remote host IP address: 10.240.96.17
Request ID: 1762384737420907
Exception name: NoSuchUpload
Error message: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
8 response headers:
connection : close
content-length : 486
content-type : application/xml
date : Wed, 05 Nov 2025 23:19:54 GMT
server : StorageGRID/11.9.0.9
x-amz-id-2 : 12844467
x-amz-request-id : 1762384737420907
x-ntap-sg-trace-id : a7b5b7b3bc907437
[WARN] 2025-11-05 23:19:54.237 AWSClient [140709638149696] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
E20251105 23:19:54.237784 33 io_util.h:23] [STORAGE][CloseFromDestructor][milvus] IOError: When destroying file of type N14milvus_storage18CustomOutputStreamE: When completing multiple part upload for key 'insert_log/461997475248214953/461997475248214954/461997475248416869/0/461997475267908837' in bucket 'pcz-dev-elevate-milvus-cluster': AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
terminate called after throwing an instance of 'std::runtime_error'
what(): When destroying file of type N14milvus_storage18CustomOutputStreamE: When completing multiple part upload for key 'insert_log/461997475248214953/461997475248214954/461997475248416869/0/461997475267908837' in bucket 'pcz-dev-elevate-milvus-cluster': AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
W20251105 23:19:54.238021 33 ExceptionTracer.cpp:187] Invalid trace stack for exception of type: std::runtime_error
SIGABRT: abort
PC=0x7ff9f9a4b9fc m=15 sigcode=18446744073709551610
signal arrived during cgo execution
goroutine 460 gp=0xc000c06700 m=15 mp=0xc000fb6008 [syscall]:
[ERROR] 2025-11-05 23:19:54.242 CurlHttpClient [140709743027776] Curl returned error code 56 - Failure when receiving data from the peer
[ERROR] 2025-11-05 23:19:54.243 AWSXmlClient [140709743027776] HTTP response code: -1
Resolved remote host IP address: 10.240.96.17
Request ID:
Exception name:
Error message: curlCode: 56, Failure when receiving data from the peer
7 response headers:
connection : CLOSE
content-type : application/xml
date : Wed, 05 Nov 2025 23:19:54 GMT
server : StorageGRID/11.9.0.9
x-amz-id-2 : 12675774
x-amz-request-id : 1762384738477376
x-ntap-sg-trace-id : 17c8fc89dd48d427
[WARN] 2025-11-05 23:19:54.243 AWSClient [140709743027776] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
[WARN] 2025-11-05 23:19:54.243 AWSClient [140709743027776] Request failed, now waiting 200 ms before attempting again.
[ERROR] 2025-11-05 23:19:54.418 CurlHttpClient [140709629756992] Curl returned error code 56 - Failure when receiving data from the peer
[ERROR] 2025-11-05 23:19:54.419 AWSXmlClient [140709629756992] HTTP response code: -1
Resolved remote host IP address: 10.240.96.17
Request ID:
Exception name:
Error message: curlCode: 56, Failure when receiving data from the peer
7 response headers:
connection : CLOSE
content-type : application/xml
date : Wed, 05 Nov 2025 23:19:53 GMT
server : StorageGRID/11.9.0.9
x-amz-id-2 : 12048421
x-amz-request-id : 1762384738444984
x-ntap-sg-trace-id : b339144a776e0936
[WARN] 2025-11-05 23:19:54.419 AWSClient [140709629756992] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
[WARN] 2025-11-05 23:19:54.419 AWSClient [140709629756992] Request failed, now waiting 200 ms before attempting again.
[WARN] 2025-11-05 23:19:54.459 AWSErrorMarshaller [140709743027776] Encountered AWSError 'NoSuchUpload': The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
[WARN] 2025-11-05 23:19:54.459 AWSErrorMarshaller [140709743027776] Encountered AWSError 'NoSuchUpload': The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
[ERROR] 2025-11-05 23:19:54.460 AWSXmlClient [140709743027776] HTTP response code: 404
Resolved remote host IP address: 10.240.96.17
Request ID: 1762384737434846
Exception name: NoSuchUpload
Error message: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
8 response headers:
connection : close
content-length : 486
content-type : application/xml
date : Wed, 05 Nov 2025 23:19:54 GMT
server : StorageGRID/11.9.0.9
x-amz-id-2 : 12748897
x-amz-request-id : 1762384737434846
x-ntap-sg-trace-id : cc307563793829e5
[WARN] 2025-11-05 23:19:54.460 AWSClient [140709743027776] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
E20251105 23:19:54.460337 20 io_util.h:23] [STORAGE][CloseFromDestructor][milvus] IOError: When destroying file of type N14milvus_storage18CustomOutputStreamE: When completing multiple part upload for key 'insert_log/461997475248214953/461997475248214954/461997475248416882/0/461997475267912412' in bucket 'pcz-dev-elevate-milvus-cluster': AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
W20251105 23:19:54.460465 20 ExceptionTracer.cpp:187] Invalid trace stack for exception of type: std::runtime_error
terminate called recursively
[WARN] 2025-11-05 23:19:54.627 AWSErrorMarshaller [140709629756992] Encountered AWSError 'NoSuchUpload': The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
[WARN] 2025-11-05 23:19:54.628 AWSErrorMarshaller [140709629756992] Encountered AWSError 'NoSuchUpload': The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
[ERROR] 2025-11-05 23:19:54.628 AWSXmlClient [140709629756992] HTTP response code: 404
Resolved remote host IP address: 10.240.96.17
Request ID: 1762384737466228
Exception name: NoSuchUpload
Error message: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
8 response headers:
connection : close
content-length : 486
content-type : application/xml
date : Wed, 05 Nov 2025 23:19:54 GMT
server : StorageGRID/11.9.0.9
x-amz-id-2 : 12715199
x-amz-request-id : 1762384737466228
x-ntap-sg-trace-id : e09f67f5323eac6e
[WARN] 2025-11-05 23:19:54.628 AWSClient [140709629756992] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
E20251105 23:19:54.628978 34 io_util.h:23] [STORAGE][CloseFromDestructor][milvus] IOError: When destroying file of type N14milvus_storage18CustomOutputStreamE: When completing multiple part upload for key 'insert_log/461997475248214963/461997475248214964/461997475248416034/0/461997475267908048' in bucket 'pcz-dev-elevate-milvus-cluster': AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
terminate called recursively
W20251105 23:19:54.629158 34 ExceptionTracer.cpp:187] Invalid trace stack for exception of type: std::runtime_error
@dmitryzykov It seems it's a netapp compatibility issues. unfortunately we don't have netapp device and are not farmiliar with StorageGRID. Can you connect NetAPP support team to help on improving the compatibility or give us some advice?
If would be good if we can fix the issue by some config settings.
We switched to NetApp ONTAP and it works without issues.
We've opened NetApp support ticket related to StorageGRID and referenced this issue.