milvus
milvus copied to clipboard
[Bug]: [perf][nightly]Constant load release operations on milvus lead to pod oom
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:master-20230406-1fc33e2e
- Deployment mode(standalone or cluster):standaloen
- MQ type(rocksmq, pulsar or kafka): rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
release_name_prefix: perf-standalone-master-1-1680786000 deploy_config: fouramf-server-standalone-8c16m case_params: fouramf-client-sift1m-load-release other_params: --milvus_tag_prefix=master -s case_name: test_concurrent_locust_custom_parameters
server:
perf-standalone86000-4-18-4753-etcd-0 1/1 Running 0 64m 10.104.9.158 4am-node14 <none> <none>
perf-standalone86000-4-18-4753-milvus-standalone-59457d878kjmfn 1/1 Running 4 (7m29s ago) 64m 10.104.14.53 4am-node18 <none> <none>
perf-standalone86000-4-18-4753-minio-767fffd748-8zwtr 1/1 Running 0 64m 10.104.4.40 4am-node11 <none> <none>
client log:
[2023-04-06 14:00:04,922 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusUnavailableException: (code=1, message=server Unavailable: Retry timeout: 60s)>, <Time:{'RPC start': '2023-04-06 13:58:32.100161', 'RPC error': '2023-04-06 14:00:04.922314'}> (decorators.py:108)
[2023-04-06 14:00:04,923 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusUnavailableException: (code=1, message=server Unavailable: Retry timeout: 60s)>, <Time:{'RPC start': '2023-04-06 13:58:27.581746', 'RPC error': '2023-04-06 14:00:04.923095'}> (decorators.py:108)
[2023-04-06 14:00:04,923 - WARNING - fouram]: [93mRetry timeout: 60s[0m (decorators.py:79)
[2023-04-06 14:00:04,923 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=1, message=Retry timeout: 60s, message=server Unavailable: Retry timeout: 60s)>, <Time:{'RPC start': '2023-04-06 13:58:27.576589', 'RPC error': '2023-04-06 14:00:04.923327'}> (decorators.py:108)
[2023-04-06 14:00:04,923 - ERROR - fouram]: Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 793, in get_loading_progress
response = self._stub.GetLoadingProgress.future(request, timeout=timeout).result()
File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 744, in result
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1680789604.921734634","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1680789604.921733357","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 710, in load_collection
self.wait_for_loading_collection(collection_name, timeout)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
raise e
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler
raise e
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 728, in wait_for_loading_collection
progress = self.get_loading_progress(collection_name, timeout=timeout)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
raise e
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 66, in handler
raise MilvusUnavailableException(message=f"server Unavailable: {timeout_msg}") from e
pymilvus.exceptions.MilvusUnavailableException: <MilvusUnavailableException: (code=1, message=server Unavailable: Retry timeout: 60s)>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/src/fouram/client/util/api_request.py", line 33, in inner_wrapper
res = func(*args, **kwargs)
File "/src/fouram/client/util/api_request.py", line 70, in api_request
return func(*arg, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 366, in load
conn.load_collection(self._name, replica_number=replica_number, timeout=timeout, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
raise e
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 80, in handler
raise MilvusException(e.code, f"{timeout_msg}, message={e.message}") from e
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 60s, message=server Unavailable: Retry timeout: 60s)>
standalone pod:
Expected Behavior
No response
Steps To Reproduce
1. create a collection
2. build index on vector column
3. insert 1m 128d of vectors
4. flush collection
5. build index on vector column with the same parameters
6. count the total number of rows
7. load collection
9. perform concurrent operations with load_release raise an error
Milvus Log
No response
Anything else?
'client': {'test_case_type': 'ConcurrentClientBase',
'test_case_name': 'test_concurrent_locust_custom_parameters',
'test_case_params': {'dataset_params': {'dim': 128,
'dataset_name': 'sift',
'dataset_size': '1m',
'ni_per': 50000,
'metric_type': 'L2'},
'collection_params': {'other_fields': []},
'load_params': {},
'search_params': {},
'index_params': {'index_type': 'HNSW',
'index_param': {'M': 8,
'efConstruction': 200}},
'concurrent_params': {'concurrent_number': 1,
'during_time': 3600,
'interval': 20,
'spawn_rate': None},
'concurrent_tasks': [{'type': 'load_release',
'weight': 1,
'params': {'replica_number': 1}}]},
/assign @jiaoew1991 set high priority /unassign
/assign @aoiasd /unassign
update: image:master-20230506-ad75afdc pymilvus:2.4.0dev12
server:
perf-standalone41000-7-8-2694-etcd-0 1/1 Running 0 64m 10.104.4.93 4am-node11 <none> <none>
perf-standalone41000-7-8-2694-milvus-standalone-5db766d557nqndm 1/1 Running 4 (6m26s ago) 64m 10.104.17.99 4am-node23 <none> <none>
perf-standalone41000-7-8-2694-minio-6dbf44f5b8-wqvjg 1/1 Running 0 64m 10.104.5.32 4am-node12 <none> <none>
log:
[2023-05-07 06:54:41,637 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusUnavailableException: (code=1, message=server Unavailable: Retry timeout: 60s)>, <Time:{'RPC start': '2023-05-07 06:53:08.808806', 'RPC error': '2023-05-07 06:54:41.637781'}> (decorators.py:108)
[2023-05-07 06:54:41,640 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusUnavailableException: (code=1, message=server Unavailable: Retry timeout: 60s)>, <Time:{'RPC start': '2023-05-07 06:53:04.722231', 'RPC error': '2023-05-07 06:54:41.640214'}> (decorators.py:108)
[2023-05-07 06:54:41,640 - WARNING - fouram]: [93mRetry timeout: 60s[0m (decorators.py:79)
[2023-05-07 06:54:41,640 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=1, message=Retry timeout: 60s, message=server Unavailable: Retry timeout: 60s)>, <Time:{'RPC start': '2023-05-07 06:53:04.715815', 'RPC error': '2023-05-07 06:54:41.640405'}> (decorators.py:108)
[2023-05-07 06:54:41,641 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=Retry timeout: 60s, message=server Unavailable: Retry timeout: 60s)> (api_request.py:49)
[2023-05-07 06:54:41,641 - ERROR - fouram]: [CheckFunc] load request check failed, response:<MilvusException: (code=1, message=Retry timeout: 60s, message=server Unavailable: Retry timeout: 60s)> (func_check.py:46)
[2023-05-07 06:54:41,776 - WARNING - fouram]: [93m[search] retry:4, cost: 0.27s, reason: <_InactiveRpcError: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:10.255.241.119:19530: Failed to connect to remote host: Connection refused>[0m (decorators.py:71)
[2023-05-07 06:54:42,048 - WARNING - fouram]: [93m[search] retry:5, cost: 0.81s, reason: <_InactiveRpcError: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:10.255.241.119:19530: Failed to connect to remote host: Connection refused>[0m (decorators.py:71)
[2023-05-07 06:54:42,860 - WARNING - fouram]: [93m[search] retry:6, cost: 2.43s, reason: <_InactiveRpcError: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:10.255.241.119:19530: Failed to connect to remote host: Connection refused>[0m (decorators.py:71)
[2023-05-07 06:54:45,295 - WARNING - fouram]: [93m[search] retry:7, cost: 7.29s, reason: <_InactiveRpcError: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:10.255.241.119:19530: Failed to connect to remote host: Connection refused>[0m (decorators.py:71)
Memory leak at ConcurrentVector and ZSTD create , now release segment could not release all memory.
Find that if boost version large than 1.80, boost::container::queue could not be released from memory. Cause memory leak at concurrentVector.
Fixed.
/unassign @aoiasd /assign @jingkl
release_name_prefix perf-single-1685521800 deploy_config fouramf-server-standalone-8c16m case_params fouramf-client-sift1m-load-search-release
image:2.2.0-20230531-ee45c791
grafana:
I0531 08:40:10.603857 382 request.go:665] Waited for 1.149781982s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/authorization.k8s.io/v1?timeout=32s
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
perf-single-16821800-7-33-8251-etcd-0 1/1 Running 0 8m15s 10.104.24.128 4am-node29 <none> <none>
perf-single-16821800-7-33-8251-milvus-standalone-5c8b5d446phtc5 1/1 Running 0 8m20s 10.104.6.53 4am-node13 <none> <none>
perf-single-16821800-7-33-8251-minio-87cc54f5-rdbtj 1/1 Running 0 8m20s 10.104.4.168 4am-node11 <none> <none> (base.py:192)
[2023-05-31 09:42:24,853 - INFO - fouram]: [Cmd Exe] kubectl get pods -n qa-milvus -o wide | grep -E 'STATUS|perf-single-16821800-7-33-8251' (util_cmd.py:14)
[2023-05-31 09:42:33,296 - INFO - fouram]: [CliClient] pod details of release(perf-single-16821800-7-33-8251):
I0531 09:42:26.105931 475 request.go:665] Waited for 1.161895204s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/garo.tietoevry.com/v1alpha1?timeout=32s
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
perf-single-16821800-7-33-8251-etcd-0 1/1 Running 0 70m 10.104.24.128 4am-node29 <none> <none>
perf-single-16821800-7-33-8251-milvus-standalone-5c8b5d446phtc5 1/1 Running 0 70m 10.104.6.53 4am-node13 <none> <none>
perf-single-16821800-7-33-8251-minio-87cc54f5-rdbtj 1/1 Running 0 70m 10.104.4.168 4am-node11 <none> <none> (cli_client.py:131)`
```
@jiaoew1991 @aoiasd
Segcore has no leak, standalone use rocksdb, and rocksdb cache will use some memory. But true memory usage will much larger than heaptrack get, this may cause by malloc.
/assign @jingkl
master-20230712-fc12d399 pymilvus 2.4.0dev81
This issue has been verified to be fixed