milvus [Bug]: [streaming] After modifying the pulsar configuration and restarting pulsar broker and streamignode, some data was lost

Is there an existing issue for this?

[x] I have searched the existing issues

Environment

- Milvus version: master-20250515-1c794be1-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server

mq: pulsar
streamingNode: 2 * 4c16g
config

    common:
      enabledJSONKeyStats: true
    dataCoord:
      enableActiveStandby: true
    indexCoord:
      enableActiveStandby: true
    log:
      level: debug
    mixCoord:
      enableActiveStandby: true
    queryCoord:
      enableActiveStandby: true
    queryNode:
      mmap:
        scalarField: true
        scalarIndex: true
        vectorField: true
        vectorIndex: true
    rootCoord:
      enableActiveStandby: true

client

create a collection with partition_key, the schema is:

{'auto_id': False,
 'description': '',
 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 128}},
            {'name': 'int64_1', 'description': '', 'type': <DataType.INT64: 5>, 'is_partition_key': True}, {'name': 'json_1', 'description': '', 'type': <DataType.JSON: 23>}],
 'enable_dynamic_field': False} (base.py:329)

Plan to insert 30 million data, but the insertion is stuck because the backlog quota is full.
Upgrading pulsar config broker.backlogQuotaDefaultLimitGB from 8 to '-1', then restarted the pulsar broker and streamingNode
Insertion recovery is successful, continue inserting data until 30m entities are inserted
Flush -> index -> load. According to the loaded segment information, the data inserted before restarting sn is lost
concurrent upsert and search

Generate data upserts starting from pk 0

results

Total 217056 entities were lost

c.name
'fouram_ENQJCF98'
c.query('id >= 14450000', output_fields=["count(*)"], consistency_level="Strong")
data: ["{'count(*)': 15550000}"]
c.query('id < 14450000', output_fields=["count(*)"], consistency_level="Strong")
data: ["{'count(*)': 14232944}"]
c.query('', output_fields=["count(*)"], consistency_level="Strong")
data: ["{'count(*)': 29782944}"]
c.query('14400000 == id ', output_fields=["count(*)"], consistency_level="Strong")
data: ["{'count(*)': 0}"]

Expected Behavior

No response

Steps To Reproduce

https://argo-workflows.zilliz.cc/archived-workflows/qa/e286881d-159f-4fb8-9b05-36e4344f4d1a?nodeId=zong-sn-par-key-4k5vd-3640764632

Milvus Log

pods:

sn-parkey-pulsar-op-25-1939-etcd-0                                1/1     Running     0               24h     10.104.18.230   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-etcd-1                                1/1     Running     0               24h     10.104.19.38    4am-node28   <none>           <none>
sn-parkey-pulsar-op-25-1939-etcd-2                                1/1     Running     0               24h     10.104.24.254   4am-node29   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-datanode-6888674d45-78gdb      1/1     Running     0               24h     10.104.30.80    4am-node38   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-datanode-6888674d45-8z5gj      1/1     Running     0               24h     10.104.18.248   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-datanode-6888674d45-zf5lx      1/1     Running     0               24h     10.104.21.239   4am-node24   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-mixcoord-59675c8c6-m69fj       1/1     Running     0               24h     10.104.18.247   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-proxy-845fdc5fff-6jzbn         1/1     Running     0               24h     10.104.18.249   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-querynode-0-8b9c944fb-2crl4    1/1     Running     0               24h     10.104.18.251   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-querynode-0-8b9c944fb-kw97f    1/1     Running     0               24h     10.104.21.241   4am-node24   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-streamingnode-79b98f9f6nlcbv   1/1     Running     0               3h26m   10.104.13.80    4am-node16   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-streamingnode-79b98f9f6xvqmx   1/1     Running     0               3h25m   10.104.6.59     4am-node13   <none>           <none>
sn-parkey-pulsar-op-25-1939-minio-0                               1/1     Running     0               24h     10.104.18.234   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-minio-1                               1/1     Running     0               24h     10.104.24.6     4am-node29   <none>           <none>
sn-parkey-pulsar-op-25-1939-minio-2                               1/1     Running     0               24h     10.104.19.40    4am-node28   <none>           <none>
sn-parkey-pulsar-op-25-1939-minio-3                               1/1     Running     0               24h     10.104.17.64    4am-node23   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-bookie-0                       1/1     Running     0               24h     10.104.19.44    4am-node28   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-bookie-1                       1/1     Running     0               24h     10.104.18.238   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-bookie-2                       1/1     Running     0               24h     10.104.17.68    4am-node23   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-broker-0                       1/1     Running     0               3h27m   10.104.19.231   4am-node28   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-broker-1                       1/1     Running     0               3h28m   10.104.13.79    4am-node16   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-proxy-0                        1/1     Running     0               24h     10.104.18.223   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-proxy-1                        1/1     Running     0               24h     10.104.19.31    4am-node28   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-recovery-0                     1/1     Running     0               24h     10.104.24.252   4am-node29   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-zookeeper-0                    1/1     Running     0               24h     10.104.19.41    4am-node28   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-zookeeper-1                    1/1     Running     0               24h     10.104.24.10    4am-node29   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-zookeeper-2                    1/1     Running     0               24h     10.104.18.237   4am-node25   <none>           <none>

Anything else?

client debug log path: /test/fouram/log/2025_05_15/zong-sn-par-key-4k5vd-2_40094/fouram_log.debug

last succ upsert request log:

566550 [2025-05-16 09:36:13,695 - DEBUG - fouram]: (api_request)  : [Collection.upsert] args: <Collection.upsert fields: 4, length: 200, content: [ [ `type<class 'int'>,        dtype<>` 2903723 ... ], [ `type<class 'list'>, dtype<>` [0.9613462011029956, 0.6 ... ], [ `type<class 'int'>, dtype<>` 2903723 ... ], [ `type<class 'dict'>, dtyp       e<>` {'id': 2903723} ... ] ]>, [None, 120], kwargs: {'client_request_id': '56f9ad0be57f4bbaad3bf92f63ca78e5'}, [requestId: 56f9ad0be57f4bbaad3bf92f63ca78e5] (api_       request.py:83)
...
567588 [2025-05-16 09:37:07,005 - DEBUG - fouram]: (api_response) : [Collection.upsert] (insert count: 200, delete count: 200, upsert count: 200, timestamp: 458067325378       887707, success count: 200, err count: 0, [requestId: 56f9ad0be57f4bbaad3bf92f63ca78e5] (api_request.py:45)

May 16 '25 10:05 ThreadDao

Checking the log from recovery storage and flowgraph. All data has already been persisted into wal.

> grep 'insert entity' pulsar_lost.log | awk '{match($0, "rows=([0-9]+)", a); b+=a[1]}END{print b}'

30000000

The data may be lost at flusher recovery or compaction?

May 19 '25 06:05 chyezh

Out retention duration of pulsar is 12h. The tailing data of wal before restarting pulsar is deleted by pulsar. So after restarting streamingnode and pulsar broker, the data is lost forever.

We need change some pulsar configuration and use truncate api to make a strong promise for pulsar data retention. Still work on it.

May 19 '25 09:05 chyezh

pr #42653 has been produced two solution for user. see the description of backlogAutoClearBytes . However, it only protect the data path of milvus, doesn't promise at query path. We will implement the strong promise at query path by the query view. So the issue can be closed now.

/assign @ThreadDao /unassign

Jun 16 '25 07:06 chyezh

Verify the milvus protection without pulsar retention

pulsar：
backlogQuotaDefaultLimitBytes=-1
[backlogQuotaDefaultLimitGB](https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=backlogquotadefaultlimitgb)=-1
[subscriptionExpirationTimeMinutes](https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=subscriptionexpirationtimeminutes)=0
[defaultRetentionTimeInMinutes](https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=defaultretentiontimeinminutes)=30

milvus:
pulsar.backlogAutoClearBytes=0
streaming.walTruncate.sampleInterval=1m
streaming.walTruncate.retentionInterval=10m

Write some test case into wal without flush. stop milvus cluster for 1h and restart it, the data should not be lost at data-side.

Jun 16 '25 08:06 chyezh

verified and passed

image: master-20250630-ecb24e72-amd64
pulsar config
milvus config The streamingNode was stopped for more than one hour, and the data was not lost after recovery metrics

Jul 02 '25 03:07 ThreadDao

milvus milvus copied to clipboard

[Bug]: [streaming] After modifying the pulsar configuration and restarting pulsar broker and streamignode, some data was lost

Is there an existing issue for this?

Environment

Current Behavior

server

client

results

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus
milvus copied to clipboard