milvus
milvus copied to clipboard
[Bug]: DataNode keeps restarting due to error: failed to serialize merged stats log: shall not serialize zero length statslog list
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: cardinal-milvus-io-2.4-9a07c1bca9-20240929
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
server
An milvus cluster that has been running for a long time, which has 2 dataNodes and 4 queryNodes
test steps
- collection
laion_stable_9has 100m-768d entities,
{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'int64_pk_5b', 'description': '', 'type': <DataType.INT64: 5>, 'is_partition_key': True}, {'name': 'varchar_caption', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'varchar_NSFW', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'float64_similarity', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'int64_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'varchar_md5', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}], 'enable_dynamic_field': True}
- In the previous test, many L0 segments (1.8k) were left behind and there was no time to perform L0 compaction. In addition, queryNode memory was tight.
- concurrent requests: Flush + load + search + query + upsert
- dataNode rkeeps restarting with error
- L0 compaction seems to no longer be triggered
links
- argo: workflow
- metrics: metrics
- Loki logs: Loki logs
- pod restart reason: container restart reason
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
pods:
laion1b-test-2-etcd-0 1/1 Running 1 (4d23h ago) 38d 10.104.25.207 4am-node30 <none> <none>
laion1b-test-2-etcd-1 1/1 Running 0 97d 10.104.30.186 4am-node38 <none> <none>
laion1b-test-2-etcd-2 1/1 Running 0 299d 10.104.34.225 4am-node37 <none> <none>
laion1b-test-2-milvus-datanode-7b8f94796b-9wb45 1/1 Running 95 (11h ago) 10d 10.104.1.226 4am-node10 <none> <none>
laion1b-test-2-milvus-datanode-7b8f94796b-hrvq2 1/1 Running 54 (4d12h ago) 10d 10.104.20.103 4am-node22 <none> <none>
laion1b-test-2-milvus-indexnode-7fc94494bd-cfgkh 1/1 Running 2 (8d ago) 10d 10.104.19.46 4am-node28 <none> <none>
laion1b-test-2-milvus-indexnode-7fc94494bd-m7xcf 1/1 Running 0 10d 10.104.30.115 4am-node38 <none> <none>
laion1b-test-2-milvus-indexnode-7fc94494bd-mf42d 1/1 Running 0 10d 10.104.32.231 4am-node39 <none> <none>
laion1b-test-2-milvus-indexnode-7fc94494bd-q88hf 1/1 Running 0 10d 10.104.16.70 4am-node21 <none> <none>
laion1b-test-2-milvus-mixcoord-b484b7777-ggxcn 1/1 Running 1 (8d ago) 10d 10.104.30.114 4am-node38 <none> <none>
laion1b-test-2-milvus-proxy-787965c494-kzlrp 1/1 Running 0 10d 10.104.32.230 4am-node39 <none> <none>
laion1b-test-2-milvus-querynode-1-7b7568c78b-24wr8 1/1 Running 3 (5d15h ago) 10d 10.104.26.61 4am-node32 <none> <none>
laion1b-test-2-milvus-querynode-1-7b7568c78b-5wp66 1/1 Running 2 (4d12h ago) 10d 10.104.24.214 4am-node29 <none> <none>
laion1b-test-2-milvus-querynode-1-7b7568c78b-8gztc 1/1 Running 2 (4d23h ago) 10d 10.104.15.164 4am-node20 <none> <none>
laion1b-test-2-milvus-querynode-1-7b7568c78b-g8lsb 1/1 Running 3 (4d23h ago) 10d 10.104.27.130 4am-node31 <none> <none>
laion1b-test-2-pulsar-bookie-0 1/1 Running 0 299d 10.104.33.107 4am-node36 <none> <none>
laion1b-test-2-pulsar-bookie-1 1/1 Running 0 102d 10.104.18.97 4am-node25 <none> <none>
laion1b-test-2-pulsar-bookie-2 1/1 Running 0 38d 10.104.25.206 4am-node30 <none> <none>
laion1b-test-2-pulsar-broker-0 1/1 Running 1 (171d ago) 180d 10.104.1.147 4am-node10 <none> <none>
laion1b-test-2-pulsar-proxy-0 1/1 Running 0 168d 10.104.32.209 4am-node39 <none> <none>
laion1b-test-2-pulsar-recovery-0 1/1 Running 1 (168d ago) 200d 10.104.31.87 4am-node34 <none> <none>
laion1b-test-2-pulsar-zookeeper-0 1/1 Running 0 299d 10.104.29.87 4am-node35 <none> <none>
laion1b-test-2-pulsar-zookeeper-1 1/1 Running 0 180d 10.104.21.196 4am-node24 <none> <none>
laion1b-test-2-pulsar-zookeeper-2 1/1 Running 0 299d 10.104.34.229 4am-node37 <none> <none>
Anything else?
No response
/assign @XuanYang-cn /unassign
Some segments stay in "Sealed" state, but cp has been advanced. And DataNode failed to flush those sealed segments and keeps panicking itself.
After rebooting DataCoord, those sealed segment becomes flushed, and no panic since.
hard to find the root cause