milvus
milvus copied to clipboard
[Bug]: Standalone pod restarted several times when concurrent insert and search multi collections
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: 2.2.0-20230504-842e5d21
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.2.8.dev1
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
- deploy standalone with image
2.2.0-20230504-842e5d21
and config
config:
log:
level: debug
quotaAndLimits:
limitWriting:
diskProtection:
diskQuotaPerCollection: 500
- Test concurrent create collection and insert search
- create collection
- insert 1m-128d data and ni is 10000
- flush and get num entities
- build hnsw index:
{"M": 8, "efConstruction": 200}
- load collection
- Search with nq=10, top_k=100 for 5000 times
- standalone pod restart 4 times
panic logs:
Expected Behavior
No response
Steps To Reproduce
fouram argo name: `quota-collections-3`
run fouramf case:
@pytest.mark.locust
@pytest.mark.parametrize("deploy_mode", [STANDALONE])
def test_concurrent_locust_multi_collections(self, input_params: InputParamsBase, deploy_mode):
"""
Used to check whether the memory usage of queryNodes is balanced.
:test steps:
1. concurrent test and calculation of RT and QPS
"""
concurrent_tasks = [
ConcurrentParams.params_scene_search_test(
weight=5, shards_num=2, data_size='1m', nb=10000, replica_number=1,
index_type=pn.IndexTypeName.HNSW, index_param={"M": 8, "efConstruction": 200}, nq=10, top_k=100, search_param={"ef": 34},
search_counts=5000)
]
default_case_params = ConcurrentParams().params_scene_concurrent(
concurrent_tasks, concurrent_number=[50], during_time="5h", interval=20, dataset_size=0, ni_per=0,
replica_number=1, **cdp.DefaultIndexParams.HNSW)
self.concurrency_template(input_params=input_params, cpu=dp.min_cpu, mem=dp.min_mem,
deploy_mode=deploy_mode, old_version_format=False,
case_callable_obj=ConcurrentClientBase().scene_concurrent_locust,
default_case_params=default_case_params)
### Milvus Log
server pods in `fouram` cluster and `qa-milvus` ns:
k get pod -o wide -n qa-milvus | grep fouram-op-54-8249
fouram-op-54-8249-etcd-0 1/1 Running 0 28h 10.104.4.130 4am-node11
client pod in `fouram` cluster and `qa` ns:
quota-collections-3-1904905947
[standalone_pre.log](https://github.com/milvus-io/milvus/files/11406741/standalone_pre.log)
### Anything else?
_No response_
/assign @jiaoew1991 /unassign
/assign @yah01 /unassign
concurrent write/read map
#23957 has fixed this
/assign @ThreadDao plz help check with #23957
rerun-image: 2.2.0-20230509-341b62d5
standalone also restarted, one oomkilled and other is completed with 0 exit code
I stop the test and update standalone pod memory from 16G to 20G, it also crashed
fouram-op-54-8249-etcd-0 1/1 Running 0 5d 10.104.4.130 4am-node11 <none> <none>
fouram-op-54-8249-milvus-standalone-865768cb7c-vhvsm 0/1 Running 10 (5m32s ago) 40m 10.104.4.170 4am-node11 <none> <none>
fouram-op-54-8249-minio-744659cbdf-h5xlr 1/1 Running 0 5d 10.104.4.131 4am-node11 <none> <none>
standalone pod previous log: standalone_pre_1.log
/assign @yah01 plz help to check whether it is caused by insufficient memory? if yew, why the exit code is 0
/assign @yah01 /unassign
image: 2.2.0-20230512-d882624b
the standalone pod also crash
fouram-op-54-8249-etcd-0 1/1 Running 0 8d
fouram-op-54-8249-milvus-standalone-576d456c9d-7z9sb 0/1 Running 2 (45s ago) 6m56s
fouram-op-54-8249-minio-744659cbdf-h5xlr 1/1 Running 0 8d
standalone also crash, exit code 0
pre log: Uploading standalone_pre_completed.log…
@yah01 any updates?
@yah01 any updates?
related #24489
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.