milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [perf][nightly]Constant load release operations on milvus lead to pod oom

Open jingkl opened this issue 1 year ago • 9 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:master-20230406-1fc33e2e
- Deployment mode(standalone or cluster):standaloen
- MQ type(rocksmq, pulsar or kafka):    rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

release_name_prefix: perf-standalone-master-1-1680786000 deploy_config: fouramf-server-standalone-8c16m case_params: fouramf-client-sift1m-load-release other_params: --milvus_tag_prefix=master -s case_name: test_concurrent_locust_custom_parameters

server:

perf-standalone86000-4-18-4753-etcd-0                             1/1     Running            0                64m    10.104.9.158    4am-node14   <none>           <none>
perf-standalone86000-4-18-4753-milvus-standalone-59457d878kjmfn   1/1     Running            4 (7m29s ago)    64m    10.104.14.53    4am-node18   <none>           <none>
perf-standalone86000-4-18-4753-minio-767fffd748-8zwtr             1/1     Running            0                64m    10.104.4.40     4am-node11   <none>           <none> 

client log:

[2023-04-06 14:00:04,922 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusUnavailableException: (code=1, message=server Unavailable: Retry timeout: 60s)>, <Time:{'RPC start': '2023-04-06 13:58:32.100161', 'RPC error': '2023-04-06 14:00:04.922314'}> (decorators.py:108)
[2023-04-06 14:00:04,923 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusUnavailableException: (code=1, message=server Unavailable: Retry timeout: 60s)>, <Time:{'RPC start': '2023-04-06 13:58:27.581746', 'RPC error': '2023-04-06 14:00:04.923095'}> (decorators.py:108)
[2023-04-06 14:00:04,923 - WARNING - fouram]: [93mRetry timeout: 60s[0m (decorators.py:79)
[2023-04-06 14:00:04,923 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=1, message=Retry timeout: 60s, message=server Unavailable: Retry timeout: 60s)>, <Time:{'RPC start': '2023-04-06 13:58:27.576589', 'RPC error': '2023-04-06 14:00:04.923327'}> (decorators.py:108)
[2023-04-06 14:00:04,923 - ERROR - fouram]: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 793, in get_loading_progress
    response = self._stub.GetLoadingProgress.future(request, timeout=timeout).result()
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 744, in result
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1680789604.921734634","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1680789604.921733357","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 710, in load_collection
    self.wait_for_loading_collection(collection_name, timeout)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
    raise e
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler
    raise e
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 728, in wait_for_loading_collection
    progress = self.get_loading_progress(collection_name, timeout=timeout)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
    raise e
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 66, in handler
    raise MilvusUnavailableException(message=f"server Unavailable: {timeout_msg}") from e
pymilvus.exceptions.MilvusUnavailableException: <MilvusUnavailableException: (code=1, message=server Unavailable: Retry timeout: 60s)>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/src/fouram/client/util/api_request.py", line 33, in inner_wrapper
    res = func(*args, **kwargs)
  File "/src/fouram/client/util/api_request.py", line 70, in api_request
    return func(*arg, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 366, in load
    conn.load_collection(self._name, replica_number=replica_number, timeout=timeout, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
    raise e
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 80, in handler
    raise MilvusException(e.code, f"{timeout_msg}, message={e.message}") from e
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 60s, message=server Unavailable: Retry timeout: 60s)>

standalone pod: 截屏2023-04-07 14 39 51

Expected Behavior

No response

Steps To Reproduce

1. create a collection 
        2. build index on vector column
        3. insert 1m 128d of vectors
        4. flush collection
        5. build index on vector column with the same parameters
        6. count the total number of rows
        7. load collection
        9. perform concurrent operations with load_release  raise an error

Milvus Log

No response

Anything else?

'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_custom_parameters',
            'test_case_params': {'dataset_params': {'dim': 128,
                                                    'dataset_name': 'sift',
                                                    'dataset_size': '1m',
                                                    'ni_per': 50000,
                                                    'metric_type': 'L2'},
                                 'collection_params': {'other_fields': []},
                                 'load_params': {},
                                 'search_params': {},
                                 'index_params': {'index_type': 'HNSW',
                                                  'index_param': {'M': 8,
                                                                  'efConstruction': 200}},
                                 'concurrent_params': {'concurrent_number': 1,
                                                       'during_time': 3600,
                                                       'interval': 20,
                                                       'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'load_release',
                                                       'weight': 1,
                                                       'params': {'replica_number': 1}}]},

jingkl avatar Apr 07 '23 06:04 jingkl

/assign @jiaoew1991 set high priority /unassign

yanliang567 avatar Apr 08 '23 02:04 yanliang567

/assign @aoiasd /unassign

jiaoew1991 avatar Apr 10 '23 09:04 jiaoew1991

update: image:master-20230506-ad75afdc pymilvus:2.4.0dev12

server:

perf-standalone41000-7-8-2694-etcd-0                              1/1     Running     0                64m     10.104.4.93     4am-node11   <none>           <none>
perf-standalone41000-7-8-2694-milvus-standalone-5db766d557nqndm   1/1     Running     4 (6m26s ago)    64m     10.104.17.99    4am-node23   <none>           <none>
perf-standalone41000-7-8-2694-minio-6dbf44f5b8-wqvjg              1/1     Running     0                64m     10.104.5.32     4am-node12   <none>           <none>

log:


[2023-05-07 06:54:41,637 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusUnavailableException: (code=1, message=server Unavailable: Retry timeout: 60s)>, <Time:{'RPC start': '2023-05-07 06:53:08.808806', 'RPC error': '2023-05-07 06:54:41.637781'}> (decorators.py:108)
[2023-05-07 06:54:41,640 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusUnavailableException: (code=1, message=server Unavailable: Retry timeout: 60s)>, <Time:{'RPC start': '2023-05-07 06:53:04.722231', 'RPC error': '2023-05-07 06:54:41.640214'}> (decorators.py:108)
[2023-05-07 06:54:41,640 - WARNING - fouram]: [93mRetry timeout: 60s[0m (decorators.py:79)
[2023-05-07 06:54:41,640 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=1, message=Retry timeout: 60s, message=server Unavailable: Retry timeout: 60s)>, <Time:{'RPC start': '2023-05-07 06:53:04.715815', 'RPC error': '2023-05-07 06:54:41.640405'}> (decorators.py:108)
[2023-05-07 06:54:41,641 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=Retry timeout: 60s, message=server Unavailable: Retry timeout: 60s)> (api_request.py:49)
[2023-05-07 06:54:41,641 - ERROR - fouram]: [CheckFunc] load request check failed, response:<MilvusException: (code=1, message=Retry timeout: 60s, message=server Unavailable: Retry timeout: 60s)> (func_check.py:46)
[2023-05-07 06:54:41,776 - WARNING - fouram]: [93m[search] retry:4, cost: 0.27s, reason: <_InactiveRpcError: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:10.255.241.119:19530: Failed to connect to remote host: Connection refused>[0m (decorators.py:71)
[2023-05-07 06:54:42,048 - WARNING - fouram]: [93m[search] retry:5, cost: 0.81s, reason: <_InactiveRpcError: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:10.255.241.119:19530: Failed to connect to remote host: Connection refused>[0m (decorators.py:71)
[2023-05-07 06:54:42,860 - WARNING - fouram]: [93m[search] retry:6, cost: 2.43s, reason: <_InactiveRpcError: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:10.255.241.119:19530: Failed to connect to remote host: Connection refused>[0m (decorators.py:71)
[2023-05-07 06:54:45,295 - WARNING - fouram]: [93m[search] retry:7, cost: 7.29s, reason: <_InactiveRpcError: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:10.255.241.119:19530: Failed to connect to remote host: Connection refused>[0m (decorators.py:71)

jingkl avatar May 08 '23 03:05 jingkl

Memory leak at ConcurrentVector and ZSTD create , now release segment could not release all memory.

aoiasd avatar May 08 '23 03:05 aoiasd

Find that if boost version large than 1.80, boost::container::queue could not be released from memory. Cause memory leak at concurrentVector.

aoiasd avatar May 09 '23 02:05 aoiasd

Fixed.

aoiasd avatar May 22 '23 02:05 aoiasd

/unassign @aoiasd /assign @jingkl

jiaoew1991 avatar May 25 '23 03:05 jiaoew1991

release_name_prefix perf-single-1685521800 deploy_config fouramf-server-standalone-8c16m case_params fouramf-client-sift1m-load-search-release

image:2.2.0-20230531-ee45c791 grafana: 截屏2023-06-01 18 57 42

I0531 08:40:10.603857     382 request.go:665] Waited for 1.149781982s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/authorization.k8s.io/v1?timeout=32s
NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
perf-single-16821800-7-33-8251-etcd-0                             1/1     Running     0               8m15s   10.104.24.128   4am-node29   <none>           <none>
perf-single-16821800-7-33-8251-milvus-standalone-5c8b5d446phtc5   1/1     Running     0               8m20s   10.104.6.53     4am-node13   <none>           <none>
perf-single-16821800-7-33-8251-minio-87cc54f5-rdbtj               1/1     Running     0               8m20s   10.104.4.168    4am-node11   <none>           <none> (base.py:192)
[2023-05-31 09:42:24,853 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'STATUS|perf-single-16821800-7-33-8251'  (util_cmd.py:14)
[2023-05-31 09:42:33,296 -  INFO - fouram]: [CliClient] pod details of release(perf-single-16821800-7-33-8251): 
 I0531 09:42:26.105931     475 request.go:665] Waited for 1.161895204s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/garo.tietoevry.com/v1alpha1?timeout=32s
NAME                                                              READY   STATUS        RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
perf-single-16821800-7-33-8251-etcd-0                             1/1     Running       0               70m     10.104.24.128   4am-node29   <none>           <none>
perf-single-16821800-7-33-8251-milvus-standalone-5c8b5d446phtc5   1/1     Running       0               70m     10.104.6.53     4am-node13   <none>           <none>
perf-single-16821800-7-33-8251-minio-87cc54f5-rdbtj               1/1     Running       0               70m     10.104.4.168    4am-node11   <none>           <none> (cli_client.py:131)`

```




jingkl avatar Jun 01 '23 10:06 jingkl

@jiaoew1991 @aoiasd

jingkl avatar Jun 01 '23 10:06 jingkl

image Segcore has no leak, standalone use rocksdb, and rocksdb cache will use some memory. But true memory usage will much larger than heaptrack get, this may cause by malloc.

aoiasd avatar Jun 16 '23 02:06 aoiasd

/assign @jingkl

aoiasd avatar Jul 04 '23 07:07 aoiasd

master-20230712-fc12d399 pymilvus 2.4.0dev81

This issue has been verified to be fixed

jingkl avatar Jul 13 '23 02:07 jingkl