milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [benckmark] growing segent concurrent search error: "failed to search: leader not available: node offline" with milvus reboot

Open elstic opened this issue 1 year ago • 2 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:2.3-20240110-1128b1dd
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The configuration I added to config.yaml makes segment guaranteed to be growing segment

  dataCoord:
    segment:
      sealProportion: 0.9
      maxIdleTime: 7200

I have a continuous searcg with scalar filtering after inserting 40w 128d vectors for milvus

server:

new-stable-mast09600-4-68-3801-etcd-0                             1/1     Running     0               2m18s   10.104.27.67    4am-node31   <none>           <none>
new-stable-mast09600-4-68-3801-milvus-standalone-7fc975956xdh4t   1/1     Running     0               2m18s   10.104.25.112   4am-node30   <none>           <none>
new-stable-mast09600-4-68-3801-minio-7559486467-9nh72             1/1     Running     0               2m18s   10.104.16.191   4am-node21   <none>           <none> (base.py:257)
[2024-01-10 22:06:04,192 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'STATUS|new-stable-mast09600-4-68-3801-milvus|new-stable-mast09600-4-68-3801-minio|new-stable-mast09600-4-68-3801-etcd|new-stable-mast09600-4-68-3801-pulsar|new-stable-mast09600-4-68-3801-kafka'  (util_cmd.py:14)
[2024-01-10 22:06:13,733 -  INFO - fouram]: [CliClient] pod details of release(new-stable-mast09600-4-68-3801): 
 I0110 22:06:05.460316     490 request.go:665] Waited for 1.1665361s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/actions.summerwind.dev/v1alpha1?timeout=32s
NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
new-stable-mast09600-4-68-3801-etcd-0                             1/1     Running     0               4h3m    10.104.27.67    4am-node31   <none>           <none>
new-stable-mast09600-4-68-3801-milvus-standalone-7fc975956xdh4t   1/1     Running     1 (148m ago)    4h3m    10.104.25.112   4am-node30   <none>           <none>
new-stable-mast09600-4-68-3801-minio-7559486467-9nh72             1/1     Running     0               4h3m    10.104.16.191   4am-node21   <none>           <none> (cli_client.py:132)

client error log: {pod=~"new-stable-master-1704909600-2128327225"} |= "ERROR" image

A milvus reboot occurs and is a growing segment before the reboot and a sealed segment after the reboot.

Expected Behavior

search should not fail no milvus reboot

Steps To Reproduce

  1. Modify config.yaml
  dataCoord:
    segment:
      sealProportion: 0.9
      maxIdleTime: 7200
  1. Deploy milvus
  2. Insert 40w 128d vector and a column of float scalars
  3. perform concurrent search with scalar filtering
  4. Step 4 lasts 4h

Milvus Log

No response

Anything else?

No response

elstic avatar Jan 11 '24 04:01 elstic

/assign @congqixia /unassign

yanliang567 avatar Jan 12 '24 01:01 yanliang567

Continuous search in the growing segment phase, with some search errors.

client pod: new-stable-master-1705600800-2994728836 client error log:

[2024-01-18 18:35:31,116 -  INFO - fouram]:  (stats.py:820)
[2024-01-18 18:35:47,596 - ERROR - fouram]: grpc RpcError: [search], <_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-01-18 18:34:47.533596', 'gRPC error': '2024-01-18 18:35:47.595973'}> (decorators.py:139)
[2024-01-18 18:35:48,165 - ERROR - fouram]: (api_response) : <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2024-01-18T18:35:47.587786919+00:00"}"
> (api_request.py:54)
[2024-01-18 18:35:48,165 - ERROR - fouram]: [CheckFunc] search request check failed, response:<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2024-01-18T18:35:47.587786919+00:00"}"
> (func_check.py:48)
[2024-01-18 18:35:48,168 - ERROR - fouram]: grpc RpcError: [search], <_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-01-18 18:34:47.535324', 'gRPC error': '2024-01-18 18:35:48.168617'}> (decorators.py:139)
[2024-01-18 18:35:48,169 - ERROR - fouram]: (api_response) : <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2024-01-18T18:35:47.59056936+00:00", grpc_status:4}"
> (api_request.py:54)
[2024-01-18 18:35:48,169 - ERROR - fouram]: [CheckFunc] search request check failed, response:<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2024-01-18T18:35:47.59056936+00:00", grpc_status:4}"

server:

new-stable-mast00800-4-51-5682-etcd-0                             1/1     Running           0                2m39s   10.104.15.49    4am-node20   <none>           <none>
new-stable-mast00800-4-51-5682-milvus-standalone-7567bc6596wv85   1/1     Running           0                2m39s   10.104.25.228   4am-node30   <none>           <none>
new-stable-mast00800-4-51-5682-minio-57b9dc878b-tzvmf             1/1     Running           0                2m39s   10.104.27.180   4am-node31   <none>           <none>

client result (A total of 18 failed searches.)

'search': {'Requests': 4979423,
          'Fails': 18,
          'RPS': 345.79,
          'fail_s': 0.0,
          'RT_max': 60640.49,
          'RT_avg': 45.48,

test env : 4am cluster

elstic avatar Jan 19 '24 06:01 elstic

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jun 15 '24 16:06 stale[bot]

This issue has not arisen recently

elstic avatar Jun 17 '24 02:06 elstic