milvus
milvus copied to clipboard
[Bug]: [benchmark][cluster] pulsar proxy restart because of unhealthy and exit code is 137
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:2.4-20240330-bc4a9a1ab-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.0rc66
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
argo task: inverted-corn-1711900800 test case name: test_inverted_locust_hnsw_ivf_sq8_dml_dql_cluster
server:
[2024-03-31 19:50:36,265 - INFO - fouram]: [Base] Deploy initial state:
I0331 16:11:19.281651 410 request.go:665] Waited for 1.151318555s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/authentication.k8s.io/v1?timeout=32s
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
inverted-corn-100800-7-14-1665-etcd-0 1/1 Running 0 6m32s 10.104.30.17 4am-node38 <none> <none>
inverted-corn-100800-7-14-1665-etcd-1 1/1 Running 0 6m32s 10.104.32.70 4am-node39 <none> <none>
inverted-corn-100800-7-14-1665-etcd-2 1/1 Running 0 6m32s 10.104.19.249 4am-node28 <none> <none>
inverted-corn-100800-7-14-1665-milvus-datacoord-d446f5dff-t7v78 1/1 Running 0 6m32s 10.104.24.40 4am-node29 <none> <none>
inverted-corn-100800-7-14-1665-milvus-datanode-7f89fb8776-28qfh 1/1 Running 1 (2m2s ago) 6m32s 10.104.24.41 4am-node29 <none> <none>
inverted-corn-100800-7-14-1665-milvus-indexcoord-9995f7d8d22q9p 1/1 Running 0 6m32s 10.104.28.164 4am-node33 <none> <none>
inverted-corn-100800-7-14-1665-milvus-indexnode-5544bc78b9hq787 1/1 Running 0 6m32s 10.104.33.142 4am-node36 <none> <none>
inverted-corn-100800-7-14-1665-milvus-indexnode-5544bc78b9pgx4d 1/1 Running 0 6m32s 10.104.9.254 4am-node14 <none> <none>
inverted-corn-100800-7-14-1665-milvus-proxy-678748dcd6-5dp2r 1/1 Running 1 (2m2s ago) 6m32s 10.104.20.133 4am-node22 <none> <none>
inverted-corn-100800-7-14-1665-milvus-querycoord-c74ccb6c955pzv 1/1 Running 1 (2m2s ago) 6m32s 10.104.33.141 4am-node36 <none> <none>
inverted-corn-100800-7-14-1665-milvus-querynode-5967f4c479ncpgw 1/1 Running 0 6m32s 10.104.27.154 4am-node31 <none> <none>
inverted-corn-100800-7-14-1665-milvus-rootcoord-6c85c66dc-j2kdr 1/1 Running 1 (2m1s ago) 6m32s 10.104.20.134 4am-node22 <none> <none>
inverted-corn-100800-7-14-1665-minio-0 1/1 Running 0 6m32s 10.104.30.9 4am-node38 <none> <none>
inverted-corn-100800-7-14-1665-minio-1 1/1 Running 0 6m32s 10.104.31.184 4am-node34 <none> <none>
inverted-corn-100800-7-14-1665-minio-2 1/1 Running 0 6m32s 10.104.19.253 4am-node28 <none> <none>
inverted-corn-100800-7-14-1665-minio-3 1/1 Running 0 6m32s 10.104.32.75 4am-node39 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-bookie-0 1/1 Running 0 6m32s 10.104.31.185 4am-node34 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-bookie-1 1/1 Running 0 6m32s 10.104.19.254 4am-node28 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-bookie-2 1/1 Running 0 6m31s 10.104.17.174 4am-node23 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-bookie-init-m2cbk 0/1 Completed 0 6m32s 10.104.4.135 4am-node11 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-broker-0 1/1 Running 0 6m32s 10.104.14.84 4am-node18 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-proxy-0 1/1 Running 0 6m32s 10.104.4.136 4am-node11 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-pulsar-init-q7v9l 0/1 Completed 0 6m32s 10.104.6.223 4am-node13 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-recovery-0 1/1 Running 0 6m32s 10.104.6.224 4am-node13 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-zookeeper-0 1/1 Running 0 6m32s 10.104.30.19 4am-node38 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-zookeeper-1 1/1 Running 0 4m33s 10.104.32.83 4am-node39 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-zookeeper-2 1/1 Running 0 3m56s 10.104.31.196 4am-node34 <none> <none> (base.py:257)
[2024-03-31 19:50:36,266 - INFO - fouram]: [Cmd Exe] kubectl get pods -n qa-milvus -o wide | grep -E 'NAME|inverted-corn-100800-7-14-1665-milvus|inverted-corn-100800-7-14-1665-minio|inverted-corn-100800-7-14-1665-etcd|inverted-corn-100800-7-14-1665-pulsar|inverted-corn-100800-7-14-1665-zookeeper|inverted-corn-100800-7-14-1665-kafka|inverted-corn-100800-7-14-1665-log|inverted-corn-100800-7-14-1665-tikv' (util_cmd.py:14)
[2024-03-31 19:50:46,383 - INFO - fouram]: [CliClient] pod details of release(inverted-corn-100800-7-14-1665):
I0331 19:50:37.536147 530 request.go:665] Waited for 1.134281501s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/discovery.k8s.io/v1beta1?timeout=32s
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
inverted-corn-100800-7-14-1665-etcd-0 1/1 Running 0 3h45m 10.104.30.17 4am-node38 <none> <none>
inverted-corn-100800-7-14-1665-etcd-1 1/1 Running 0 3h45m 10.104.32.70 4am-node39 <none> <none>
inverted-corn-100800-7-14-1665-etcd-2 1/1 Running 0 3h45m 10.104.19.249 4am-node28 <none> <none>
inverted-corn-100800-7-14-1665-milvus-datacoord-d446f5dff-t7v78 1/1 Running 0 3h45m 10.104.24.40 4am-node29 <none> <none>
inverted-corn-100800-7-14-1665-milvus-datanode-7f89fb8776-28qfh 1/1 Running 1 (3h41m ago) 3h45m 10.104.24.41 4am-node29 <none> <none>
inverted-corn-100800-7-14-1665-milvus-indexcoord-9995f7d8d22q9p 1/1 Running 0 3h45m 10.104.28.164 4am-node33 <none> <none>
inverted-corn-100800-7-14-1665-milvus-indexnode-5544bc78b9hq787 1/1 Running 0 3h45m 10.104.33.142 4am-node36 <none> <none>
inverted-corn-100800-7-14-1665-milvus-indexnode-5544bc78b9pgx4d 1/1 Running 0 3h45m 10.104.9.254 4am-node14 <none> <none>
inverted-corn-100800-7-14-1665-milvus-proxy-678748dcd6-5dp2r 1/1 Running 1 (3h41m ago) 3h45m 10.104.20.133 4am-node22 <none> <none>
inverted-corn-100800-7-14-1665-milvus-querycoord-c74ccb6c955pzv 1/1 Running 1 (3h41m ago) 3h45m 10.104.33.141 4am-node36 <none> <none>
inverted-corn-100800-7-14-1665-milvus-querynode-5967f4c479ncpgw 1/1 Running 0 3h45m 10.104.27.154 4am-node31 <none> <none>
inverted-corn-100800-7-14-1665-milvus-rootcoord-6c85c66dc-j2kdr 1/1 Running 1 (3h41m ago) 3h45m 10.104.20.134 4am-node22 <none> <none>
inverted-corn-100800-7-14-1665-minio-0 1/1 Running 0 3h45m 10.104.30.9 4am-node38 <none> <none>
inverted-corn-100800-7-14-1665-minio-1 1/1 Running 0 3h45m 10.104.31.184 4am-node34 <none> <none>
inverted-corn-100800-7-14-1665-minio-2 1/1 Running 0 3h45m 10.104.19.253 4am-node28 <none> <none>
inverted-corn-100800-7-14-1665-minio-3 1/1 Running 0 3h45m 10.104.32.75 4am-node39 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-bookie-0 1/1 Running 0 3h45m 10.104.31.185 4am-node34 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-bookie-1 1/1 Running 0 3h45m 10.104.19.254 4am-node28 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-bookie-2 1/1 Running 0 3h45m 10.104.17.174 4am-node23 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-bookie-init-m2cbk 0/1 Completed 0 3h45m 10.104.4.135 4am-node11 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-broker-0 1/1 Running 0 3h45m 10.104.14.84 4am-node18 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-proxy-0 1/1 Running 1 (16m ago) 3h45m 10.104.4.136 4am-node11 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-pulsar-init-q7v9l 0/1 Completed 0 3h45m 10.104.6.223 4am-node13 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-recovery-0 1/1 Running 0 3h45m 10.104.6.224 4am-node13 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-zookeeper-0 1/1 Running 0 3h45m 10.104.30.19 4am-node38 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-zookeeper-1 1/1 Running 0 3h43m 10.104.32.83 4am-node39 <none> <none>
inverted-corn-100800-7-14-1665-pulsar-zookeeper-2 1/1 Running 0 3h43m 10.104.31.196 4am-node34 <none> <none>
{app="k8s-event-logger",cluster="4am"} |="inverted-corn-100800-7-14-1665-pulsar-proxy-0"
kube_pod_container_status_last_terminated_exitcode{pod="inverted-corn-100800-7-14-1665-pulsar-proxy-0"}
pulsar proxy resource
pulsar proxy log:
inverted-corn-100800-7-14-1665-pulsar-proxy-0.log
client pod name: inverted-corn-1711900800-1214320440 client logs: client raises error: 2024-03-31 19:12:31,522 ~ 2024-03-31 19:37:13,509 client.log
Expected Behavior
No response
Steps To Reproduce
concurrent test and calculation of RT and QPS
:purpose: `vector: memory index`
verify concurrent DML & DQL scenario which has 2 float_vector fields & 16 scalar fields
:test steps:
1. create collection with fields:
'float_vector': 128dim,
'float_vector_1': 200dim,
'int8_1', 'int16_1', 'int32_1', 'int64_1', 'double_1', 'float_1', 'varchar_1', 'bool_1',
'int8_2', 'int16_2', 'int32_2', 'int64_2', 'double_2', 'float_2', 'varchar_2', 'bool_2'
2. build indexes:
HNSW: 'float_vector'
IVF_SQ8: 'float_vector_1'
scalar_default_index: 'int8_1', 'int16_1', 'int32_1', 'int64_1', 'double_1', 'float_1', 'varchar_1'
scalar_INVERTED_index: 'int8_2', 'int16_2', 'int32_2', 'int64_2', 'double_2', 'float_2', 'varchar_2', 'bool_2'
3. insert 5 million data
4. flush collection
5. build indexes again using the same params
6. load collection
7. concurrent request:
- insert
- delete
- flush
- load
- search
- hybrid_search
- query
Milvus Log
No response
Anything else?
test result:
[2024-03-31 19:50:06,061 - INFO - fouram]: Print locust final stats. (locust_runner.py:56)
[2024-03-31 19:50:06,062 - INFO - fouram]: Type Name # reqs # fails | Avg Min Max Med | req/s failures/s (stats.py:789)
[2024-03-31 19:50:06,062 - INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-03-31 19:50:06,062 - INFO - fouram]: grpc delete 7001 18(0.26%) | 807 4 30002 160 | 0.65 0.00 (stats.py:789)
[2024-03-31 19:50:06,062 - INFO - fouram]: grpc flush 7008 17(0.24%) | 10127 149 690019 6600 | 0.65 0.00 (stats.py:789)
[2024-03-31 19:50:06,062 - INFO - fouram]: grpc hybrid_search 7140 28(0.39%) | 4735 257 60103 4200 | 0.66 0.00 (stats.py:789)
[2024-03-31 19:50:06,062 - INFO - fouram]: grpc insert 7132 34(0.48%) | 998 42 30071 250 | 0.66 0.00 (stats.py:789)
[2024-03-31 19:50:06,063 - INFO - fouram]: grpc load 7148 0(0.00%) | 1898 9 179834 430 | 0.66 0.00 (stats.py:789)
[2024-03-31 19:50:06,063 - INFO - fouram]: grpc query 7141 12(0.17%) | 5285 315 180078 4600 | 0.66 0.00 (stats.py:789)
[2024-03-31 19:50:06,063 - INFO - fouram]: grpc search 7263 15(0.21%) | 6233 617 180171 5600 | 0.67 0.00 (stats.py:789)
[2024-03-31 19:50:06,063 - INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-03-31 19:50:06,063 - INFO - fouram]: Aggregated 49833 124(0.25%) | 4297 4 690019 3100 | 4.61 0.01 (stats.py:789)
[2024-03-31 19:50:06,063 - INFO - fouram]: (stats.py:790)
[2024-03-31 19:50:06,067 - INFO - fouram]: [PerfTemplate] Report data:
{'server': {'deploy_tool': 'helm',
'deploy_mode': 'cluster',
'config_name': 'cluster_8c16m',
'config': {'queryNode': {'resources': {'limits': {'cpu': '16.0',
'memory': '64Gi'},
'requests': {'cpu': '9.0',
'memory': '33Gi'}},
'replicas': 1},
'indexNode': {'resources': {'limits': {'cpu': '8.0',
'memory': '16Gi'},
'requests': {'cpu': '5.0',
'memory': '9Gi'}},
'replicas': 2},
'dataNode': {'resources': {'limits': {'cpu': '8.0',
'memory': '16Gi'},
'requests': {'cpu': '5.0',
'memory': '9Gi'}}},
'cluster': {'enabled': True},
'pulsar': {},
'kafka': {},
'minio': {'metrics': {'podMonitor': {'enabled': True}}},
'etcd': {'metrics': {'enabled': True,
'podMonitor': {'enabled': True}}},
'metrics': {'serviceMonitor': {'enabled': True}},
'log': {'level': 'debug'},
'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus',
'tag': '2.4-20240330-bc4a9a1ab-amd64'}}},
'host': 'inverted-corn-100800-7-14-1665-milvus.qa-milvus.svc.cluster.local',
'port': '19530',
'uri': ''},
'client': {'test_case_type': 'ConcurrentClientBase',
'test_case_name': 'test_inverted_locust_hnsw_ivf_sq8_dml_dql_cluster',
'test_case_params': {'dataset_params': {'metric_type': 'L2',
'dim': 128,
'scalars_index': {'int8_1': {},
'int16_1': {},
'int32_1': {},
'int64_1': {},
'double_1': {},
'float_1': {},
'varchar_1': {},
'int8_2': {'index_type': 'INVERTED'},
'int16_2': {'index_type': 'INVERTED'},
'int32_2': {'index_type': 'INVERTED'},
'int64_2': {'index_type': 'INVERTED'},
'double_2': {'index_type': 'INVERTED'},
'float_2': {'index_type': 'INVERTED'},
'varchar_2': {'index_type': 'INVERTED'},
'bool_2': {'index_type': 'INVERTED'}},
'vectors_index': {'float_vector_1': {'index_type': 'IVF_SQ8',
'index_param': {'nlist': 1024},
'metric_type': 'L2'}},
'scalars_params': {'float_vector_1': {'params': {'dim': 200},
'other_params': {'dataset': 'text2img',
'dim': 200}}},
'dataset_name': 'sift',
'dataset_size': 5000000,
'ni_per': 5000},
'collection_params': {'other_fields': ['float_vector_1',
'int8_1',
'int16_1',
'int32_1',
'int64_1',
'double_1',
'float_1',
'varchar_1',
'bool_1',
'int8_2',
'int16_2',
'int32_2',
'int64_2',
'double_2',
'float_2',
'varchar_2',
'bool_2'],
'shards_num': 2},
'resource_groups_params': {'reset': False},
'database_user_params': {'reset_rbac': False,
'reset_db': False},
'index_params': {'index_type': 'HNSW',
'index_param': {'M': 8,
'efConstruction': 200}},
'concurrent_params': {'concurrent_number': 20,
'during_time': '3h',
'interval': 20,
'spawn_rate': None},
'concurrent_tasks': [{'type': 'insert',
'weight': 1,
'params': {'nb': 10,
'timeout': 30,
'random_id': True,
'random_vector': True,
'varchar_filled': False,
'start_id': 5000000}},
{'type': 'delete',
'weight': 1,
'params': {'expr': '',
'delete_length': 9,
'timeout': 30}},
{'type': 'flush',
'weight': 1,
'params': {'timeout': 600}},
{'type': 'load',
'weight': 1,
'params': {'replica_number': 1,
'timeout': 180}},
{'type': 'search',
'weight': 1,
'params': {'nq': 1000,
'top_k': 1,
'search_param': {'ef': 64},
'expr': 'int64_1 '
'> '
'-1 '
'&& '
'id '
'> '
'-1',
'guarantee_timestamp': None,
'partition_names': None,
'output_fields': ['*'],
'ignore_growing': False,
'group_by_field': None,
'timeout': 180,
'random_data': True}},
{'type': 'hybrid_search',
'weight': 1,
'params': {'nq': 1,
'top_k': 10,
'reqs': [{'search_param': {'nprobe': 16},
'anns_field': 'float_vector_1',
'expr': 'varchar_1 '
'like '
'"0%" '
'&& '
'bool_2 '
'== '
'True',
'top_k': 2000},
{'search_param': {'ef': 128},
'anns_field': 'float_vector',
'expr': 'int64_1 '
'< '
'100000 '
'&& '
'float_2 '
'> '
'10.0'}],
'rerank': {'WeightedRanker': [0.5,
0.5]},
'output_fields': ['*'],
'ignore_growing': False,
'guarantee_timestamp': None,
'partition_names': None,
'timeout': 60,
'random_data': True}},
{'type': 'query',
'weight': 1,
'params': {'ids': None,
'expr': 'int64_1 '
'> '
'-1 '
'&& '
'int64_2 '
'> '
'-1 '
'&& ',
'output_fields': ['*'],
'offset': None,
'limit': None,
'ignore_growing': False,
'partition_names': None,
'timeout': 180,
'random_data': True,
'random_count': 20,
'random_range': [2500000.0,
5000000],
'field_name': 'id',
'field_type': 'int64'}}]},
'run_id': 2024033110926729,
'datetime': '2024-03-31 16:04:52.219749',
'client_version': '2.4.0'},
'result': {'test_result': {'index': {'RT': 628.0449,
'float_vector_1': {'RT': 432.0188},
'int8_1': {'RT': 228.0169},
'int16_1': {'RT': 1.0271},
'int32_1': {'RT': 26.3016},
'int64_1': {'RT': 0.5208},
'double_1': {'RT': 0.5194},
'float_1': {'RT': 2.0338},
'varchar_1': {'RT': 0.5692},
'int8_2': {'RT': 0.5897},
'int16_2': {'RT': 0.5189},
'int32_2': {'RT': 0.52},
'int64_2': {'RT': 0.5214},
'double_2': {'RT': 0.5734},
'float_2': {'RT': 0.6055},
'varchar_2': {'RT': 0.5883},
'bool_2': {'RT': 0.6554}},
'insert': {'total_time': 578.7469,
'VPS': 8639.3551,
'batch_time': 0.5787,
'batch': 5000},
'flush': {'RT': 3.5326},
'load': {'RT': 24.1605},
'Locust': {'Aggregated': {'Requests': 49833,
'Fails': 124,
'RPS': 4.61,
'fail_s': 0.0,
'RT_max': 690019.93,
'RT_avg': 4297.34,
'TP50': 3100.0,
'TP99': 17000.0},
'delete': {'Requests': 7001,
'Fails': 18,
'RPS': 0.65,
'fail_s': 0.0,
'RT_max': 30002.59,
'RT_avg': 807.37,
'TP50': 160.0,
'TP99': 8400.0},
'flush': {'Requests': 7008,
'Fails': 17,
'RPS': 0.65,
'fail_s': 0.0,
'RT_max': 690019.93,
'RT_avg': 10127.74,
'TP50': 6600.0,
'TP99': 26000.0},
'hybrid_search': {'Requests': 7140,
'Fails': 28,
'RPS': 0.66,
'fail_s': 0.0,
'RT_max': 60103.92,
'RT_avg': 4735.98,
'TP50': 4200.0,
'TP99': 11000.0},
'insert': {'Requests': 7132,
'Fails': 34,
'RPS': 0.66,
'fail_s': 0.0,
'RT_max': 30071.36,
'RT_avg': 998.29,
'TP50': 250.0,
'TP99': 8800.0},
'load': {'Requests': 7148,
'Fails': 0,
'RPS': 0.66,
'fail_s': 0.0,
'RT_max': 179834.89,
'RT_avg': 1898.88,
'TP50': 430.0,
'TP99': 11000.0},
'query': {'Requests': 7141,
'Fails': 12,
'RPS': 0.66,
'fail_s': 0.0,
'RT_max': 180078.39,
'RT_avg': 5285.08,
'TP50': 4600.0,
'TP99': 13000.0},
'search': {'Requests': 7263,
'Fails': 15,
'RPS': 0.67,
'fail_s': 0.0,
'RT_max': 180171.67,
'RT_avg': 6233.42,
'TP50': 5600.0,
'TP99': 13000.0}}}}}
/assign @zhagnlu /unassign
puslar exit with code 137 means be killed by external action, maybe operator system or manual operator.
cpu && memory is under limit, so not by OOM killer.
maybe by accidental deletion, keep check whether reproduce.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.