milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: streamingnode: AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload

Open dmitryzykov opened this issue 1 month ago • 9 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version: v2.6.3 (helm chart 5.0.4)
- Deployment mode: cluster
- MQ type: pulsar

Current Behavior

We have fresh milvus installation and streamingnode component got crashed all the time with AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload error

milvus-cluster-streamingnode-6695d4cd6d-2gf5p     0/1     CrashLoopBackOff   193 (4m2s ago)    21h
milvus-cluster-streamingnode-6695d4cd6d-m6v4c     0/1     Running            192 (6m36s ago)   21h

it was working fine with the same s3 server during milvus 2.4.x and we have problems when we trying 2.6.x

Expected Behavior

No response

Steps To Reproduce



log:
  level: debug
cluster:
  enabled: true
streaming:
  enabled: true  
image:
  all:
    repository: harbor/proxy-public-hub-docker-com/milvusdb/milvus  
  tools:
    repository: harbo/proxy-public-hub-docker-com/milvusdb/milvus-config-tool    
etcd:
  global:
    imageRegistry: harbor/proxy-public-hub-docker-com
  replicaCount: 3
  persistence:
    storageClass: pcz-ha-zone-1-pv-latebinding
pulsarv3:
  enabled: true
  images:
    zookeeper:
      repository:  harbor/proxy-public-hub-docker-com/apachepulsar/pulsar
    bookie:
      repository:  harbor/proxy-public-hub-docker-com/apachepulsar/pulsar
    autorecovery:
      repository:  harbor/proxy-public-hub-docker-com/apachepulsar/pulsar
    broker:
      repository: harbor/proxy-public-hub-docker-com/apachepulsar/pulsar
    toolset:
      repository: harbo/proxy-public-hub-docker-com/apachepulsar/pulsar
    proxy:
      repository: harbo/proxy-public-hub-docker-com/apachepulsar/pulsar
    functions:
      repository: harbor/proxy-public-hub-docker-com/apachepulsar/pulsar
    pulsar_manager:
      repository: harbor/proxy-public-hub-docker-com/apachepulsar/pulsar-manager
  bookkeeper:
    volumes:
      persistence: true
      journal:
        size: 10Gi
        storageClassName: pcz-ha-zone-1-pv-latebinding
      ledgers:
        size: 50Gi
        storageClassName: pcz-ha-zone-1-pv-latebinding        
  zookeeper:
    volumes:
      persistence: true
      data:
        size: 20Gi
        storageClassName: pcz-ha-zone-1-pv-latebinding
minio:
  enabled: false
  image:
    repository: harbor/proxy-public-hub-docker-com/minio/minio
  mode: distributed
  persistence:
    storageClass: pcz-ha-zone-1-pv-latebinding
externalS3:
  enabled: true
  host: ""
  port: "8082"
  accessKey: ""
  secretKey: ""
  useSSL: true
  bucketName: "abc"
  rootPath: ""
  useIAM: false
  cloudProvider: "aws"
  iamEndpoint: ""
  region: ""
  useVirtualHost: false     
standalone:
  replicas: 3
  persistence:
    persistentVolumeClaim:
      storageClass: pcz-ha-zone-1-pv-latebinding      
proxy:
  replicas: 3   
rootCoordinator:
  enabled: false  
  replicas: 3  
  activeStandby:
    enabled: true  
queryCoordinator:
  enabled: false  
  replicas: 3
  activeStandby:
    enabled: true  
queryNode:
  replicas: 3  
indexNode:
  replicas: 3    
  enabled: false
dataCoordinator:
  enabled: false  
  replicas: 3  
  activeStandby:
    enabled: true  
dataNode:
  replicas: 3   
  resources:
    limits:
      cpu: 2000m
      memory: 2Gi
    requests:
      cpu: 1000m
      memory: 1Gi  
streamingNode:
  replicas: 3   
  resources:
    limits:
      cpu: 2000m
      memory: 2Gi
    requests:
      cpu: 1000m
      memory: 1Gi           
mixCoordinator:
  replicas: 3   
  enabled: true
  activeStandby:
    enabled: true  
indexCoordinator:
  enabled: false  
  replicas: 3
  activeStandby:
    enabled: true

metrics:
  serviceMonitor:
    enabled: true      
service:
  type: LoadBalancer

Milvus Log

E20251014 21:46:35.823302    24 io_util.h:23] [STORAGE][CloseFromDestructor][milvus] IOError: When destroying file of type N14milvus_storage18CustomOutputStreamE: When completing multiple part upload for key 'insert_log/461478028016095115/461478028016095116/461478028017295145/0/461478028018094347' in bucket 'pcz-dev-elevate-milvus-cluster': AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
W20251014 21:46:35.823572    24 ExceptionTracer.cpp:187] Invalid trace stack for exception of type: std::runtime_error
terminate called after throwing an instance of 'std::runtime_error'
  what():  When destroying file of type N14milvus_storage18CustomOutputStreamE: When completing multiple part upload for key 'insert_log/461478028016095115/461478028016095116/461478028017295145/0/461478028018094347' in bucket 'pcz-dev-elevate-milvus-cluster': AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
SIGABRT: abort
PC=0x7f0584fca9fc m=8 sigcode=18446744073709551610
signal arrived during cgo execution

Anything else?

we tested multipart s3 upload from the same pod to the same s3 server and it works:

bash s3-multipart-test-v1.sh 🔧 Generating 10MB x 5 test file... 🪄 Starting multipart upload... ✅ UploadId: KzjbiVjXXjFGbGluzHGC4m8txmFUJC0SmJs2JtJDdWiFPHyTu9Lz6mFO6Q ⬆️ Uploading part 1 (part-aa)... ETag: "f1c9645dbc14efddc7d8a322685f26eb" ⬆️ Uploading part 2 (part-ab)... ETag: "f1c9645dbc14efddc7d8a322685f26eb" ⬆️ Uploading part 3 (part-ac)... ETag: "f1c9645dbc14efddc7d8a322685f26eb" ⬆️ Uploading part 4 (part-ad)... ETag: "f1c9645dbc14efddc7d8a322685f26eb" ⬆️ Uploading part 5 (part-ae)... ETag: "f1c9645dbc14efddc7d8a322685f26eb" 🧾 Completing multipart upload... { "Location": "https://xxxtest/multipart-test.bin", "Bucket": "xxx", "Key": "test/multipart-test.bin", "ETag": ""b112a68f6cb4e22726d733bdaf03535a-5"" } ✅ Multipart upload completed successfully! 🧹 Cleaning up temp files... 🎉 Done.

dmitryzykov avatar Oct 14 '25 22:10 dmitryzykov

/assign @LoveEachDay /unassign

yanliang567 avatar Oct 15 '25 01:10 yanliang567

/assign @shaoting-huang

yanliang567 avatar Oct 15 '25 03:10 yanliang567

Hi @dmitryzykov, thanks for the report!

Starting from v2.6.x, Milvus introduced multipart upload in the storage layer. This error message usually indicates that the multipart upload ID is invalid — it might have been aborted, completed, or cleaned up by the server.

Here are a few things you can double-check:

  1. Check if any network timeouts or retries happened during upload.
  2. Look in the logs for any message like: When aborting multiple part upload for key.

If possible, please share the full Milvus logs or any reproducible materials with us. If the logs are too large, you can contact us directly at [email protected].

shaoting-huang avatar Oct 21 '25 02:10 shaoting-huang

Hi @shaoting-huang I provided logs to your email.

dmitryzykov avatar Oct 21 '25 22:10 dmitryzykov

@shaoting-huang any progress on this?

xiaofan-luan avatar Oct 27 '25 11:10 xiaofan-luan

Hi @dmitryzykov

I attempted to reproduce the issue but was unable to observe the same behavior.

Could you please confirm whether the issue is consistently reproducible? If so, kindly provide the SDK scripts you used so that we can reproduce the issue on our end.

Additionally, please check whether there are any potential network issues that might be contributing to the problem.

shaoting-huang avatar Oct 28 '25 03:10 shaoting-huang

Hi @shaoting-huang We tested again with latest 2.6.4 And it works with minio and AWS S3 external storages, but it doesn't work with NetApp StorageGRID 11.9.0.9.

Our steps to reproduce: On completely fresh installation of milvus we try to restore backup and now in data logs we can see the same error:

[2025/11/05 23:19:46.367 +00:00] [INFO] [importv2/task_import.go:277] ["start to sync import data"] [taskID=461997475248416881] [jobID=461997475248214984] [collectionID=461997475248214953] [type=ImportTask]
[2025/11/05 23:19:46.367 +00:00] [INFO] [syncmgr/sync_manager.go:159] ["sync mgr sumbit task with key"] [key=461997475248416882]
[2025/11/05 23:19:46.367 +00:00] [INFO] [syncmgr/task.go:244] ["sync new split columns"] [segmentID=461997475248416882] [columnGroups="[\"[GroupID: 0, ColumnIndices: [0 3 4], Fields: [100 0 1]]\",\"[GroupID: 1, ColumnIndices: [2], Fields: [102]]\",\"[GroupID: 101, ColumnIndices: [1], Fields: [101]]\"]"]
[2025/11/05 23:19:46.486 +00:00] [INFO] [importv2/task_import.go:277] ["start to sync import data"] [taskID=461997475248416881] [jobID=461997475248214984] [collectionID=461997475248214953] [type=ImportTask]
[2025/11/05 23:19:46.486 +00:00] [INFO] [syncmgr/sync_manager.go:159] ["sync mgr sumbit task with key"] [key=461997475248416882]
[ERROR] 2025-11-05 23:19:46.673 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:46.673 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[WARN] 2025-11-05 23:19:46.673 EC2MetadataClient [140709629756992] Request failed, now waiting 0 ms before attempting again.
[ERROR] 2025-11-05 23:19:47.675 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:47.675 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[ERROR] 2025-11-05 23:19:47.675 EC2MetadataClient [140709629756992] Can not retrieve resource from http://169.254.169.254/latest/api/token
[ERROR] 2025-11-05 23:19:48.676 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:48.676 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[WARN] 2025-11-05 23:19:48.676 EC2MetadataClient [140709629756992] Request failed, now waiting 0 ms before attempting again.
[ERROR] 2025-11-05 23:19:49.677 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:49.678 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[ERROR] 2025-11-05 23:19:49.678 EC2MetadataClient [140709629756992] Can not retrieve resource from http://169.254.169.254/latest/meta-data/iam/security-credentials
[ERROR] 2025-11-05 23:19:50.679 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:50.680 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[WARN] 2025-11-05 23:19:50.680 EC2MetadataClient [140709629756992] Request failed, now waiting 0 ms before attempting again.
[ERROR] 2025-11-05 23:19:51.681 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:51.681 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[ERROR] 2025-11-05 23:19:51.681 EC2MetadataClient [140709629756992] Can not retrieve resource from http://169.254.169.254/latest/meta-data/placement/availability-zone
[ERROR] 2025-11-05 23:19:52.683 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:52.683 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[WARN] 2025-11-05 23:19:52.683 EC2MetadataClient [140709629756992] Request failed, now waiting 0 ms before attempting again.
[ERROR] 2025-11-05 23:19:53.685 CurlHttpClient [140709629756992] Curl returned error code 28 - Timeout was reached
[ERROR] 2025-11-05 23:19:53.685 EC2MetadataClient [140709629756992] Http request to retrieve credentials failed
[ERROR] 2025-11-05 23:19:53.685 EC2MetadataClient [140709629756992] Can not retrieve resource from http://169.254.169.254/latest/meta-data/placement/availability-zone
I20251105 23:19:53.715765    33 scope_metric.cpp:49] [SERVER][~FuncScopeMetric][milvus][][CGO Call] slow function NewPackedWriterWithStorageConfig done with duration 7.997663294s
I20251105 23:19:53.716043    20 scope_metric.cpp:49] [SERVER][~FuncScopeMetric][milvus][][CGO Call] slow function NewPackedWriterWithStorageConfig done with duration 7.31485271s
I20251105 23:19:53.717676    34 scope_metric.cpp:49] [SERVER][~FuncScopeMetric][milvus][][CGO Call] slow function NewPackedWriterWithStorageConfig done with duration 8.059913749s
[ERROR] 2025-11-05 23:19:54.021 CurlHttpClient [140709638149696] Curl returned error code 56 - Failure when receiving data from the peer
[ERROR] 2025-11-05 23:19:54.022 AWSXmlClient [140709638149696] HTTP response code: -1
Resolved remote host IP address: 10.240.96.17
Request ID: 
Exception name: 
Error message: curlCode: 56, Failure when receiving data from the peer
7 response headers:
connection : CLOSE
content-type : application/xml
date : Wed, 05 Nov 2025 23:19:53 GMT
server : StorageGRID/11.9.0.9
x-amz-id-2 : 12953738
x-amz-request-id : 1762384738467820
x-ntap-sg-trace-id : ba4df73e38005d04
[WARN] 2025-11-05 23:19:54.022 AWSClient [140709638149696] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
[WARN] 2025-11-05 23:19:54.022 AWSClient [140709638149696] Request failed, now waiting 200 ms before attempting again.
[WARN] 2025-11-05 23:19:54.237 AWSErrorMarshaller [140709638149696] Encountered AWSError 'NoSuchUpload': The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
[WARN] 2025-11-05 23:19:54.237 AWSErrorMarshaller [140709638149696] Encountered AWSError 'NoSuchUpload': The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
[ERROR] 2025-11-05 23:19:54.237 AWSXmlClient [140709638149696] HTTP response code: 404
Resolved remote host IP address: 10.240.96.17
Request ID: 1762384737420907
Exception name: NoSuchUpload
Error message: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
8 response headers:
connection : close
content-length : 486
content-type : application/xml
date : Wed, 05 Nov 2025 23:19:54 GMT
server : StorageGRID/11.9.0.9
x-amz-id-2 : 12844467
x-amz-request-id : 1762384737420907
x-ntap-sg-trace-id : a7b5b7b3bc907437
[WARN] 2025-11-05 23:19:54.237 AWSClient [140709638149696] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
E20251105 23:19:54.237784    33 io_util.h:23] [STORAGE][CloseFromDestructor][milvus] IOError: When destroying file of type N14milvus_storage18CustomOutputStreamE: When completing multiple part upload for key 'insert_log/461997475248214953/461997475248214954/461997475248416869/0/461997475267908837' in bucket 'pcz-dev-elevate-milvus-cluster': AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
terminate called after throwing an instance of 'std::runtime_error'
  what():  When destroying file of type N14milvus_storage18CustomOutputStreamE: When completing multiple part upload for key 'insert_log/461997475248214953/461997475248214954/461997475248416869/0/461997475267908837' in bucket 'pcz-dev-elevate-milvus-cluster': AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
W20251105 23:19:54.238021    33 ExceptionTracer.cpp:187] Invalid trace stack for exception of type: std::runtime_error
SIGABRT: abort
PC=0x7ff9f9a4b9fc m=15 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 460 gp=0xc000c06700 m=15 mp=0xc000fb6008 [syscall]:
[ERROR] 2025-11-05 23:19:54.242 CurlHttpClient [140709743027776] Curl returned error code 56 - Failure when receiving data from the peer
[ERROR] 2025-11-05 23:19:54.243 AWSXmlClient [140709743027776] HTTP response code: -1
Resolved remote host IP address: 10.240.96.17
Request ID: 
Exception name: 
Error message: curlCode: 56, Failure when receiving data from the peer
7 response headers:
connection : CLOSE
content-type : application/xml
date : Wed, 05 Nov 2025 23:19:54 GMT
server : StorageGRID/11.9.0.9
x-amz-id-2 : 12675774
x-amz-request-id : 1762384738477376
x-ntap-sg-trace-id : 17c8fc89dd48d427
[WARN] 2025-11-05 23:19:54.243 AWSClient [140709743027776] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
[WARN] 2025-11-05 23:19:54.243 AWSClient [140709743027776] Request failed, now waiting 200 ms before attempting again.
[ERROR] 2025-11-05 23:19:54.418 CurlHttpClient [140709629756992] Curl returned error code 56 - Failure when receiving data from the peer
[ERROR] 2025-11-05 23:19:54.419 AWSXmlClient [140709629756992] HTTP response code: -1
Resolved remote host IP address: 10.240.96.17
Request ID: 
Exception name: 
Error message: curlCode: 56, Failure when receiving data from the peer
7 response headers:
connection : CLOSE
content-type : application/xml
date : Wed, 05 Nov 2025 23:19:53 GMT
server : StorageGRID/11.9.0.9
x-amz-id-2 : 12048421
x-amz-request-id : 1762384738444984
x-ntap-sg-trace-id : b339144a776e0936
[WARN] 2025-11-05 23:19:54.419 AWSClient [140709629756992] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
[WARN] 2025-11-05 23:19:54.419 AWSClient [140709629756992] Request failed, now waiting 200 ms before attempting again.
[WARN] 2025-11-05 23:19:54.459 AWSErrorMarshaller [140709743027776] Encountered AWSError 'NoSuchUpload': The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
[WARN] 2025-11-05 23:19:54.459 AWSErrorMarshaller [140709743027776] Encountered AWSError 'NoSuchUpload': The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
[ERROR] 2025-11-05 23:19:54.460 AWSXmlClient [140709743027776] HTTP response code: 404
Resolved remote host IP address: 10.240.96.17
Request ID: 1762384737434846
Exception name: NoSuchUpload
Error message: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
8 response headers:
connection : close
content-length : 486
content-type : application/xml
date : Wed, 05 Nov 2025 23:19:54 GMT
server : StorageGRID/11.9.0.9
x-amz-id-2 : 12748897
x-amz-request-id : 1762384737434846
x-ntap-sg-trace-id : cc307563793829e5
[WARN] 2025-11-05 23:19:54.460 AWSClient [140709743027776] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
E20251105 23:19:54.460337    20 io_util.h:23] [STORAGE][CloseFromDestructor][milvus] IOError: When destroying file of type N14milvus_storage18CustomOutputStreamE: When completing multiple part upload for key 'insert_log/461997475248214953/461997475248214954/461997475248416882/0/461997475267912412' in bucket 'pcz-dev-elevate-milvus-cluster': AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
W20251105 23:19:54.460465    20 ExceptionTracer.cpp:187] Invalid trace stack for exception of type: std::runtime_error
terminate called recursively
[WARN] 2025-11-05 23:19:54.627 AWSErrorMarshaller [140709629756992] Encountered AWSError 'NoSuchUpload': The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
[WARN] 2025-11-05 23:19:54.628 AWSErrorMarshaller [140709629756992] Encountered AWSError 'NoSuchUpload': The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
[ERROR] 2025-11-05 23:19:54.628 AWSXmlClient [140709629756992] HTTP response code: 404
Resolved remote host IP address: 10.240.96.17
Request ID: 1762384737466228
Exception name: NoSuchUpload
Error message: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
8 response headers:
connection : close
content-length : 486
content-type : application/xml
date : Wed, 05 Nov 2025 23:19:54 GMT
server : StorageGRID/11.9.0.9
x-amz-id-2 : 12715199
x-amz-request-id : 1762384737466228
x-ntap-sg-trace-id : e09f67f5323eac6e
[WARN] 2025-11-05 23:19:54.628 AWSClient [140709629756992] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
E20251105 23:19:54.628978    34 io_util.h:23] [STORAGE][CloseFromDestructor][milvus] IOError: When destroying file of type N14milvus_storage18CustomOutputStreamE: When completing multiple part upload for key 'insert_log/461997475248214963/461997475248214964/461997475248416034/0/461997475267908048' in bucket 'pcz-dev-elevate-milvus-cluster': AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed.
terminate called recursively
W20251105 23:19:54.629158    34 ExceptionTracer.cpp:187] Invalid trace stack for exception of type: std::runtime_error

dmitryzykov avatar Nov 06 '25 00:11 dmitryzykov

@dmitryzykov It seems it's a netapp compatibility issues. unfortunately we don't have netapp device and are not farmiliar with StorageGRID. Can you connect NetAPP support team to help on improving the compatibility or give us some advice?

If would be good if we can fix the issue by some config settings.

xiaofan-luan avatar Nov 06 '25 08:11 xiaofan-luan

We switched to NetApp ONTAP and it works without issues.

We've opened NetApp support ticket related to StorageGRID and referenced this issue.

dmitryzykov avatar Nov 15 '25 01:11 dmitryzykov