Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: cardinal-milvus-io-2.4-9a07c1bca9-20240929
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server

An milvus cluster that has been running for a long time, which has 2 dataNodes and 4 queryNodes

test steps

collection laion_stable_9 has 100m-768d entities,

{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'int64_pk_5b', 'description': '', 'type': <DataType.INT64: 5>, 'is_partition_key': True}, {'name': 'varchar_caption', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'varchar_NSFW', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'float64_similarity', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'int64_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'varchar_md5', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}], 'enable_dynamic_field': True}

In the previous test, many L0 segments (1.8k) were left behind and there was no time to perform L0 compaction. In addition, queryNode memory was tight.
concurrent requests: Flush + load + search + query + upsert
dataNode rkeeps restarting with error
L0 compaction seems to no longer be triggered

links

argo: workflow
metrics: metrics
Loki logs: Loki logs
pod restart reason: container restart reason

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

pods:

laion1b-test-2-etcd-0                                             1/1     Running     1 (4d23h ago)    38d    10.104.25.207   4am-node30   <none>           <none>
laion1b-test-2-etcd-1                                             1/1     Running     0                97d    10.104.30.186   4am-node38   <none>           <none>
laion1b-test-2-etcd-2                                             1/1     Running     0                299d   10.104.34.225   4am-node37   <none>           <none>
laion1b-test-2-milvus-datanode-7b8f94796b-9wb45                   1/1     Running     95 (11h ago)     10d    10.104.1.226    4am-node10   <none>           <none>
laion1b-test-2-milvus-datanode-7b8f94796b-hrvq2                   1/1     Running     54 (4d12h ago)   10d    10.104.20.103   4am-node22   <none>           <none>
laion1b-test-2-milvus-indexnode-7fc94494bd-cfgkh                  1/1     Running     2 (8d ago)       10d    10.104.19.46    4am-node28   <none>           <none>
laion1b-test-2-milvus-indexnode-7fc94494bd-m7xcf                  1/1     Running     0                10d    10.104.30.115   4am-node38   <none>           <none>
laion1b-test-2-milvus-indexnode-7fc94494bd-mf42d                  1/1     Running     0                10d    10.104.32.231   4am-node39   <none>           <none>
laion1b-test-2-milvus-indexnode-7fc94494bd-q88hf                  1/1     Running     0                10d    10.104.16.70    4am-node21   <none>           <none>
laion1b-test-2-milvus-mixcoord-b484b7777-ggxcn                    1/1     Running     1 (8d ago)       10d    10.104.30.114   4am-node38   <none>           <none>
laion1b-test-2-milvus-proxy-787965c494-kzlrp                      1/1     Running     0                10d    10.104.32.230   4am-node39   <none>           <none>
laion1b-test-2-milvus-querynode-1-7b7568c78b-24wr8                1/1     Running     3 (5d15h ago)    10d    10.104.26.61    4am-node32   <none>           <none>
laion1b-test-2-milvus-querynode-1-7b7568c78b-5wp66                1/1     Running     2 (4d12h ago)    10d    10.104.24.214   4am-node29   <none>           <none>
laion1b-test-2-milvus-querynode-1-7b7568c78b-8gztc                1/1     Running     2 (4d23h ago)    10d    10.104.15.164   4am-node20   <none>           <none>
laion1b-test-2-milvus-querynode-1-7b7568c78b-g8lsb                1/1     Running     3 (4d23h ago)    10d    10.104.27.130   4am-node31   <none>           <none>
laion1b-test-2-pulsar-bookie-0                                    1/1     Running     0                299d   10.104.33.107   4am-node36   <none>           <none>
laion1b-test-2-pulsar-bookie-1                                    1/1     Running     0                102d   10.104.18.97    4am-node25   <none>           <none>
laion1b-test-2-pulsar-bookie-2                                    1/1     Running     0                38d    10.104.25.206   4am-node30   <none>           <none>
laion1b-test-2-pulsar-broker-0                                    1/1     Running     1 (171d ago)     180d   10.104.1.147    4am-node10   <none>           <none>
laion1b-test-2-pulsar-proxy-0                                     1/1     Running     0                168d   10.104.32.209   4am-node39   <none>           <none>
laion1b-test-2-pulsar-recovery-0                                  1/1     Running     1 (168d ago)     200d   10.104.31.87    4am-node34   <none>           <none>
laion1b-test-2-pulsar-zookeeper-0                                 1/1     Running     0                299d   10.104.29.87    4am-node35   <none>           <none>
laion1b-test-2-pulsar-zookeeper-1                                 1/1     Running     0                180d   10.104.21.196   4am-node24   <none>           <none>
laion1b-test-2-pulsar-zookeeper-2                                 1/1     Running     0                299d   10.104.34.229   4am-node37   <none>           <none>

Anything else?

No response

Oct 09 '24 12:10 ThreadDao

/assign @XuanYang-cn /unassign

Oct 10 '24 00:10 yanliang567

Some segments stay in "Sealed" state, but cp has been advanced. And DataNode failed to flush those sealed segments and keeps panicking itself.

After rebooting DataCoord, those sealed segment becomes flushed, and no panic since.

Oct 10 '24 06:10 XuanYang-cn

hard to find the root cause

Nov 14 '24 07:11 ThreadDao

milvus
milvus copied to clipboard

[Bug]: DataNode keeps restarting due to error: failed to serialize merged stats log: shall not serialize zero length statslog list

Is there an existing issue for this?

Environment

Current Behavior

server

test steps

links

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus milvus copied to clipboard

[Bug]: DataNode keeps restarting due to error: failed to serialize merged stats log: shall not serialize zero length statslog list

Is there an existing issue for this?

Environment

Current Behavior

server

test steps

links

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus
milvus copied to clipboard