milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [streaming] After modifying the pulsar configuration and restarting pulsar broker and streamignode, some data was lost

Open ThreadDao opened this issue 6 months ago • 4 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version: master-20250515-1c794be1-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server

  • mq: pulsar
  • streamingNode: 2 * 4c16g
  • config
    common:
      enabledJSONKeyStats: true
    dataCoord:
      enableActiveStandby: true
    indexCoord:
      enableActiveStandby: true
    log:
      level: debug
    mixCoord:
      enableActiveStandby: true
    queryCoord:
      enableActiveStandby: true
    queryNode:
      mmap:
        scalarField: true
        scalarIndex: true
        vectorField: true
        vectorIndex: true
    rootCoord:
      enableActiveStandby: true

client

  1. create a collection with partition_key, the schema is:
{'auto_id': False,
 'description': '',
 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 128}},
            {'name': 'int64_1', 'description': '', 'type': <DataType.INT64: 5>, 'is_partition_key': True}, {'name': 'json_1', 'description': '', 'type': <DataType.JSON: 23>}],
 'enable_dynamic_field': False} (base.py:329)
  1. Plan to insert 30 million data, but the insertion is stuck because the backlog quota is full.
  2. Upgrading pulsar config broker.backlogQuotaDefaultLimitGB from 8 to '-1', then restarted the pulsar broker and streamingNode
  3. Insertion recovery is successful, continue inserting data until 30m entities are inserted Image
  4. Flush -> index -> load. According to the loaded segment information, the data inserted before restarting sn is lost Image
  5. concurrent upsert and search
  • Generate data upserts starting from pk 0

results

  • Total 217056 entities were lost
c.name
'fouram_ENQJCF98'
c.query('id >= 14450000', output_fields=["count(*)"], consistency_level="Strong")
data: ["{'count(*)': 15550000}"]
c.query('id < 14450000', output_fields=["count(*)"], consistency_level="Strong")
data: ["{'count(*)': 14232944}"]
c.query('', output_fields=["count(*)"], consistency_level="Strong")
data: ["{'count(*)': 29782944}"]
c.query('14400000 == id ', output_fields=["count(*)"], consistency_level="Strong")
data: ["{'count(*)': 0}"]

Expected Behavior

No response

Steps To Reproduce

https://argo-workflows.zilliz.cc/archived-workflows/qa/e286881d-159f-4fb8-9b05-36e4344f4d1a?nodeId=zong-sn-par-key-4k5vd-3640764632

Milvus Log

pods:

sn-parkey-pulsar-op-25-1939-etcd-0                                1/1     Running     0               24h     10.104.18.230   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-etcd-1                                1/1     Running     0               24h     10.104.19.38    4am-node28   <none>           <none>
sn-parkey-pulsar-op-25-1939-etcd-2                                1/1     Running     0               24h     10.104.24.254   4am-node29   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-datanode-6888674d45-78gdb      1/1     Running     0               24h     10.104.30.80    4am-node38   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-datanode-6888674d45-8z5gj      1/1     Running     0               24h     10.104.18.248   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-datanode-6888674d45-zf5lx      1/1     Running     0               24h     10.104.21.239   4am-node24   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-mixcoord-59675c8c6-m69fj       1/1     Running     0               24h     10.104.18.247   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-proxy-845fdc5fff-6jzbn         1/1     Running     0               24h     10.104.18.249   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-querynode-0-8b9c944fb-2crl4    1/1     Running     0               24h     10.104.18.251   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-querynode-0-8b9c944fb-kw97f    1/1     Running     0               24h     10.104.21.241   4am-node24   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-streamingnode-79b98f9f6nlcbv   1/1     Running     0               3h26m   10.104.13.80    4am-node16   <none>           <none>
sn-parkey-pulsar-op-25-1939-milvus-streamingnode-79b98f9f6xvqmx   1/1     Running     0               3h25m   10.104.6.59     4am-node13   <none>           <none>
sn-parkey-pulsar-op-25-1939-minio-0                               1/1     Running     0               24h     10.104.18.234   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-minio-1                               1/1     Running     0               24h     10.104.24.6     4am-node29   <none>           <none>
sn-parkey-pulsar-op-25-1939-minio-2                               1/1     Running     0               24h     10.104.19.40    4am-node28   <none>           <none>
sn-parkey-pulsar-op-25-1939-minio-3                               1/1     Running     0               24h     10.104.17.64    4am-node23   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-bookie-0                       1/1     Running     0               24h     10.104.19.44    4am-node28   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-bookie-1                       1/1     Running     0               24h     10.104.18.238   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-bookie-2                       1/1     Running     0               24h     10.104.17.68    4am-node23   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-broker-0                       1/1     Running     0               3h27m   10.104.19.231   4am-node28   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-broker-1                       1/1     Running     0               3h28m   10.104.13.79    4am-node16   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-proxy-0                        1/1     Running     0               24h     10.104.18.223   4am-node25   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-proxy-1                        1/1     Running     0               24h     10.104.19.31    4am-node28   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-recovery-0                     1/1     Running     0               24h     10.104.24.252   4am-node29   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-zookeeper-0                    1/1     Running     0               24h     10.104.19.41    4am-node28   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-zookeeper-1                    1/1     Running     0               24h     10.104.24.10    4am-node29   <none>           <none>
sn-parkey-pulsar-op-25-1939-pulsar-zookeeper-2                    1/1     Running     0               24h     10.104.18.237   4am-node25   <none>           <none>

Anything else?

client debug log path: /test/fouram/log/2025_05_15/zong-sn-par-key-4k5vd-2_40094/fouram_log.debug

  • last succ upsert request log:
566550 [2025-05-16 09:36:13,695 - DEBUG - fouram]: (api_request)  : [Collection.upsert] args: <Collection.upsert fields: 4, length: 200, content: [ [ `type<class 'int'>,        dtype<>` 2903723 ... ], [ `type<class 'list'>, dtype<>` [0.9613462011029956, 0.6 ... ], [ `type<class 'int'>, dtype<>` 2903723 ... ], [ `type<class 'dict'>, dtyp       e<>` {'id': 2903723} ... ] ]>, [None, 120], kwargs: {'client_request_id': '56f9ad0be57f4bbaad3bf92f63ca78e5'}, [requestId: 56f9ad0be57f4bbaad3bf92f63ca78e5] (api_       request.py:83)
...
567588 [2025-05-16 09:37:07,005 - DEBUG - fouram]: (api_response) : [Collection.upsert] (insert count: 200, delete count: 200, upsert count: 200, timestamp: 458067325378       887707, success count: 200, err count: 0, [requestId: 56f9ad0be57f4bbaad3bf92f63ca78e5] (api_request.py:45)

ThreadDao avatar May 16 '25 10:05 ThreadDao

Checking the log from recovery storage and flowgraph. All data has already been persisted into wal.

> grep 'insert entity' pulsar_lost.log | awk '{match($0, "rows=([0-9]+)", a); b+=a[1]}END{print b}'

30000000

The data may be lost at flusher recovery or compaction?

chyezh avatar May 19 '25 06:05 chyezh

Out retention duration of pulsar is 12h. The tailing data of wal before restarting pulsar is deleted by pulsar. So after restarting streamingnode and pulsar broker, the data is lost forever.

We need change some pulsar configuration and use truncate api to make a strong promise for pulsar data retention. Still work on it.

chyezh avatar May 19 '25 09:05 chyezh

pr #42653 has been produced two solution for user. see the description of backlogAutoClearBytes . However, it only protect the data path of milvus, doesn't promise at query path. We will implement the strong promise at query path by the query view. So the issue can be closed now.

/assign @ThreadDao /unassign

chyezh avatar Jun 16 '25 07:06 chyezh

Verify the milvus protection without pulsar retention

pulsar:
backlogQuotaDefaultLimitBytes=-1
[backlogQuotaDefaultLimitGB](https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=backlogquotadefaultlimitgb)=-1
[subscriptionExpirationTimeMinutes](https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=subscriptionexpirationtimeminutes)=0
[defaultRetentionTimeInMinutes](https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=defaultretentiontimeinminutes)=30
milvus:
pulsar.backlogAutoClearBytes=0
streaming.walTruncate.sampleInterval=1m
streaming.walTruncate.retentionInterval=10m

Write some test case into wal without flush. stop milvus cluster for 1h and restart it, the data should not be lost at data-side.

chyezh avatar Jun 16 '25 08:06 chyezh

verified and passed

  • image: master-20250630-ecb24e72-amd64
  • pulsar config Image
  • milvus config Image The streamingNode was stopped for more than one hour, and the data was not lost after recovery metrics

ThreadDao avatar Jul 02 '25 03:07 ThreadDao