milvus
milvus copied to clipboard
[Bug]: [streaming] After modifying the pulsar configuration and restarting pulsar broker and streamignode, some data was lost
Is there an existing issue for this?
- [x] I have searched the existing issues
Environment
- Milvus version: master-20250515-1c794be1-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
server
- mq: pulsar
- streamingNode: 2 * 4c16g
- config
common:
enabledJSONKeyStats: true
dataCoord:
enableActiveStandby: true
indexCoord:
enableActiveStandby: true
log:
level: debug
mixCoord:
enableActiveStandby: true
queryCoord:
enableActiveStandby: true
queryNode:
mmap:
scalarField: true
scalarIndex: true
vectorField: true
vectorIndex: true
rootCoord:
enableActiveStandby: true
client
- create a collection with partition_key, the schema is:
{'auto_id': False,
'description': '',
'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 128}},
{'name': 'int64_1', 'description': '', 'type': <DataType.INT64: 5>, 'is_partition_key': True}, {'name': 'json_1', 'description': '', 'type': <DataType.JSON: 23>}],
'enable_dynamic_field': False} (base.py:329)
- Plan to insert 30 million data, but the insertion is stuck because the backlog quota is full.
- Upgrading pulsar config
broker.backlogQuotaDefaultLimitGBfrom8to '-1', then restarted the pulsar broker and streamingNode - Insertion recovery is successful, continue inserting data until 30m entities are inserted
- Flush -> index -> load. According to the loaded segment information, the data inserted before restarting sn is lost
- concurrent upsert and search
- Generate data upserts starting from pk 0
results
- Total 217056 entities were lost
c.name
'fouram_ENQJCF98'
c.query('id >= 14450000', output_fields=["count(*)"], consistency_level="Strong")
data: ["{'count(*)': 15550000}"]
c.query('id < 14450000', output_fields=["count(*)"], consistency_level="Strong")
data: ["{'count(*)': 14232944}"]
c.query('', output_fields=["count(*)"], consistency_level="Strong")
data: ["{'count(*)': 29782944}"]
c.query('14400000 == id ', output_fields=["count(*)"], consistency_level="Strong")
data: ["{'count(*)': 0}"]
Expected Behavior
No response
Steps To Reproduce
https://argo-workflows.zilliz.cc/archived-workflows/qa/e286881d-159f-4fb8-9b05-36e4344f4d1a?nodeId=zong-sn-par-key-4k5vd-3640764632
Milvus Log
pods:
sn-parkey-pulsar-op-25-1939-etcd-0 1/1 Running 0 24h 10.104.18.230 4am-node25 <none> <none>
sn-parkey-pulsar-op-25-1939-etcd-1 1/1 Running 0 24h 10.104.19.38 4am-node28 <none> <none>
sn-parkey-pulsar-op-25-1939-etcd-2 1/1 Running 0 24h 10.104.24.254 4am-node29 <none> <none>
sn-parkey-pulsar-op-25-1939-milvus-datanode-6888674d45-78gdb 1/1 Running 0 24h 10.104.30.80 4am-node38 <none> <none>
sn-parkey-pulsar-op-25-1939-milvus-datanode-6888674d45-8z5gj 1/1 Running 0 24h 10.104.18.248 4am-node25 <none> <none>
sn-parkey-pulsar-op-25-1939-milvus-datanode-6888674d45-zf5lx 1/1 Running 0 24h 10.104.21.239 4am-node24 <none> <none>
sn-parkey-pulsar-op-25-1939-milvus-mixcoord-59675c8c6-m69fj 1/1 Running 0 24h 10.104.18.247 4am-node25 <none> <none>
sn-parkey-pulsar-op-25-1939-milvus-proxy-845fdc5fff-6jzbn 1/1 Running 0 24h 10.104.18.249 4am-node25 <none> <none>
sn-parkey-pulsar-op-25-1939-milvus-querynode-0-8b9c944fb-2crl4 1/1 Running 0 24h 10.104.18.251 4am-node25 <none> <none>
sn-parkey-pulsar-op-25-1939-milvus-querynode-0-8b9c944fb-kw97f 1/1 Running 0 24h 10.104.21.241 4am-node24 <none> <none>
sn-parkey-pulsar-op-25-1939-milvus-streamingnode-79b98f9f6nlcbv 1/1 Running 0 3h26m 10.104.13.80 4am-node16 <none> <none>
sn-parkey-pulsar-op-25-1939-milvus-streamingnode-79b98f9f6xvqmx 1/1 Running 0 3h25m 10.104.6.59 4am-node13 <none> <none>
sn-parkey-pulsar-op-25-1939-minio-0 1/1 Running 0 24h 10.104.18.234 4am-node25 <none> <none>
sn-parkey-pulsar-op-25-1939-minio-1 1/1 Running 0 24h 10.104.24.6 4am-node29 <none> <none>
sn-parkey-pulsar-op-25-1939-minio-2 1/1 Running 0 24h 10.104.19.40 4am-node28 <none> <none>
sn-parkey-pulsar-op-25-1939-minio-3 1/1 Running 0 24h 10.104.17.64 4am-node23 <none> <none>
sn-parkey-pulsar-op-25-1939-pulsar-bookie-0 1/1 Running 0 24h 10.104.19.44 4am-node28 <none> <none>
sn-parkey-pulsar-op-25-1939-pulsar-bookie-1 1/1 Running 0 24h 10.104.18.238 4am-node25 <none> <none>
sn-parkey-pulsar-op-25-1939-pulsar-bookie-2 1/1 Running 0 24h 10.104.17.68 4am-node23 <none> <none>
sn-parkey-pulsar-op-25-1939-pulsar-broker-0 1/1 Running 0 3h27m 10.104.19.231 4am-node28 <none> <none>
sn-parkey-pulsar-op-25-1939-pulsar-broker-1 1/1 Running 0 3h28m 10.104.13.79 4am-node16 <none> <none>
sn-parkey-pulsar-op-25-1939-pulsar-proxy-0 1/1 Running 0 24h 10.104.18.223 4am-node25 <none> <none>
sn-parkey-pulsar-op-25-1939-pulsar-proxy-1 1/1 Running 0 24h 10.104.19.31 4am-node28 <none> <none>
sn-parkey-pulsar-op-25-1939-pulsar-recovery-0 1/1 Running 0 24h 10.104.24.252 4am-node29 <none> <none>
sn-parkey-pulsar-op-25-1939-pulsar-zookeeper-0 1/1 Running 0 24h 10.104.19.41 4am-node28 <none> <none>
sn-parkey-pulsar-op-25-1939-pulsar-zookeeper-1 1/1 Running 0 24h 10.104.24.10 4am-node29 <none> <none>
sn-parkey-pulsar-op-25-1939-pulsar-zookeeper-2 1/1 Running 0 24h 10.104.18.237 4am-node25 <none> <none>
Anything else?
client debug log path: /test/fouram/log/2025_05_15/zong-sn-par-key-4k5vd-2_40094/fouram_log.debug
- last succ upsert request log:
566550 [2025-05-16 09:36:13,695 - DEBUG - fouram]: (api_request) : [Collection.upsert] args: <Collection.upsert fields: 4, length: 200, content: [ [ `type<class 'int'>, dtype<>` 2903723 ... ], [ `type<class 'list'>, dtype<>` [0.9613462011029956, 0.6 ... ], [ `type<class 'int'>, dtype<>` 2903723 ... ], [ `type<class 'dict'>, dtyp e<>` {'id': 2903723} ... ] ]>, [None, 120], kwargs: {'client_request_id': '56f9ad0be57f4bbaad3bf92f63ca78e5'}, [requestId: 56f9ad0be57f4bbaad3bf92f63ca78e5] (api_ request.py:83)
...
567588 [2025-05-16 09:37:07,005 - DEBUG - fouram]: (api_response) : [Collection.upsert] (insert count: 200, delete count: 200, upsert count: 200, timestamp: 458067325378 887707, success count: 200, err count: 0, [requestId: 56f9ad0be57f4bbaad3bf92f63ca78e5] (api_request.py:45)
Checking the log from recovery storage and flowgraph. All data has already been persisted into wal.
> grep 'insert entity' pulsar_lost.log | awk '{match($0, "rows=([0-9]+)", a); b+=a[1]}END{print b}'
30000000
The data may be lost at flusher recovery or compaction?
Out retention duration of pulsar is 12h. The tailing data of wal before restarting pulsar is deleted by pulsar. So after restarting streamingnode and pulsar broker, the data is lost forever.
We need change some pulsar configuration and use truncate api to make a strong promise for pulsar data retention. Still work on it.
pr #42653 has been produced two solution for user.
see the description of backlogAutoClearBytes .
However, it only protect the data path of milvus, doesn't promise at query path.
We will implement the strong promise at query path by the query view.
So the issue can be closed now.
/assign @ThreadDao /unassign
Verify the milvus protection without pulsar retention
pulsar:
backlogQuotaDefaultLimitBytes=-1
[backlogQuotaDefaultLimitGB](https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=backlogquotadefaultlimitgb)=-1
[subscriptionExpirationTimeMinutes](https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=subscriptionexpirationtimeminutes)=0
[defaultRetentionTimeInMinutes](https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=defaultretentiontimeinminutes)=30
milvus:
pulsar.backlogAutoClearBytes=0
streaming.walTruncate.sampleInterval=1m
streaming.walTruncate.retentionInterval=10m
Write some test case into wal without flush. stop milvus cluster for 1h and restart it, the data should not be lost at data-side.
verified and passed
- image: master-20250630-ecb24e72-amd64
- pulsar config
- milvus config
The streamingNode was stopped for more than one hour, and the data was not lost after recovery metrics