milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Querynode panic with error `set tSafe failed, err = cannot found tSafer, vChannel = xxx`

Open zhuwenxing opened this issue 1 year ago • 6 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:master-20230419-f14c2211
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2023-04-19T09:31:59.078Z] 2023-04-19 09:31:56.666 | INFO     | MainThread |utils:load_and_search:195 - collection name: task_1_HNSW

[2023-04-19T09:31:59.078Z] 2023-04-19 09:31:56.666 | INFO     | MainThread |utils:load_and_search:196 - load collection

[2023-04-19T09:32:03.234Z] RPC error: [load_collection], <MilvusException: (code=1, message=failed to load collection, err=failed to update next target[context deadline exceeded])>, <Time:{'RPC start': '2023-04-19 09:31:56.666783', 'RPC error': '2023-04-19 09:32:02.565349'}>

[2023-04-19T09:32:03.234Z] Traceback (most recent call last):

[2023-04-19T09:32:03.234Z]   File "scripts/action_before_reinstall.py", line 47, in <module>

[2023-04-19T09:32:03.234Z]     task_1(data_size, host)

[2023-04-19T09:32:03.234Z]   File "scripts/action_before_reinstall.py", line 16, in task_1

[2023-04-19T09:32:03.234Z]     load_and_search(prefix)

[2023-04-19T09:32:03.234Z]   File "/home/jenkins/agent/workspace/tests/python_client/deploy/scripts/utils.py", line 199, in load_and_search

[2023-04-19T09:32:03.234Z]     c.load()

[2023-04-19T09:32:03.234Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 366, in load

[2023-04-19T09:32:03.234Z]     conn.load_collection(self._name, replica_number=replica_number, timeout=timeout, **kwargs)

[2023-04-19T09:32:03.234Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-04-19T09:32:03.234Z]     raise e

[2023-04-19T09:32:03.234Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-04-19T09:32:03.234Z]     return func(*args, **kwargs)

[2023-04-19T09:32:03.234Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-04-19T09:32:03.234Z]     ret = func(self, *args, **kwargs)

[2023-04-19T09:32:03.234Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-04-19T09:32:03.234Z]     raise e

[2023-04-19T09:32:03.234Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-04-19T09:32:03.234Z]     return func(self, *args, **kwargs)

[2023-04-19T09:32:03.234Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 707, in load_collection

[2023-04-19T09:32:03.234Z]     raise MilvusException(response.error_code, response.reason)

[2023-04-19T09:32:03.234Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=failed to load collection, err=failed to update next target[context deadline exceeded])>

image

[2023/04/19 09:41:39.546 +00:00] [INFO] [msgdispatcher/manager.go:157] ["dispatcherManager is running..."] [role=querynode] [nodeID=4] [pchannel=by-dev-rootcoord-dml_107]
[2023/04/19 09:41:39.547 +00:00] [INFO] [pipeline/delete_node.go:55] ["pipeline fetch delete msg"] [collectionID=440899049168419421] [partitionID=-1] [insertRowNum=6] [timestampMin=440899265124368386] [timestampMax=440899265124368386]
[2023/04/19 09:41:39.547 +00:00] [DEBUG] [delegator/delegator_data.go:141] ["start to process delete"] [collectionID=440899049168419421] [channel=by-dev-rootcoord-dml_106_440899049168419421v0] [replicaID=440899078333136965] [ts=440899265150582785]
panic: serviceTimeNode setTSafe timeout, collectionID = 440899049168419421, err = set tSafe failed, err = cannot found tSafer, vChannel = by-dev-rootcoord-dml_106_440899049168419421v0

goroutine 127172 [running]:
panic({0x32ba200, 0xc00f8ba740})
	/usr/local/go/src/runtime/panic.go:941 +0x397 fp=0xc0011a3ca0 sp=0xc0011a3be0 pc=0x1259157
github.com/milvus-io/milvus/internal/querynodev2/pipeline.(*deleteNode).Operate(0xc00618aa50, {0x3196f60?, 0xc0013fee40})
	/go/src/github.com/milvus-io/milvus/internal/querynodev2/pipeline/delete_node.go:82 +0x265 fp=0xc0011a3e80 sp=0xc0011a3ca0 pc=0x2cc3805
github.com/milvus-io/milvus/internal/util/pipeline.(*nodeCtx).work(0xc003e8ff40)
	/go/src/github.com/milvus-io/milvus/internal/util/pipeline/node.go:77 +0x20e fp=0xc0011a3fc8 sp=0xc0011a3e80 pc=0x2cc12ee
github.com/milvus-io/milvus/internal/util/pipeline.(*nodeCtx).Start.func1()
	/go/src/github.com/milvus-io/milvus/internal/util/pipeline/node.go:51 +0x26 fp=0xc0011a3fe0 sp=0xc0011a3fc8 pc=0x2cc1046
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc0011a3fe8 sp=0xc0011a3fe0 pc=0x128e6a1
created by github.com/milvus-io/milvus/internal/util/pipeline.(*nodeCtx).Start
	/go/src/github.com/milvus-io/milvus/internal/util/pipeline/node.go:51 +0x7b

goroutine 1 [chan receive, 15 minutes]:
runtime.gopark(0xc000639728?, 0xc000639780?, 0x13?, 0xae?, 0xc000639780?)
	/usr/local/go/src/runtime/proc.go:361 +0xd6 fp=0xc00185f750 sp=0xc00185f730 pc=0x125c3b6
runtime.chanrecv(0xc000478f00, 0xc000639be8, 0x1)
	/usr/local/go/src/runtime/chan.go:577 +0x56c fp=0xc00185f7e0 sp=0xc00185f750 pc=0x1225aac
runtime.chanrecv1(0xc000478f00?, 0xc000639db8?)
	/usr/local/go/src/runtime/chan.go:440 +0x18 fp=0xc00185f808 sp=0xc00185f7e0 pc=0x12254d8
github.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).Run(0xc000639e58, 0x0, {0xc000a30070?, 0xd?})
	/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:336 +0xa6c fp=0xc00185fe08 sp=0xc00185f808 pc=0x2e52e4c
github.com/milvus-io/milvus/cmd/milvus.(*run).execute(0xc000a2a210, {0xc000216060?, 0x3, 0x3}, 0xc000848e40)
	/go/src/github.com/milvus-io/milvus/cmd/milvus/run.go:112 +0x66e fp=0xc00185fee0 sp=0xc00185fe08 pc=0x2e5f4ce
github.com/milvus-io/milvus/cmd/milvus.RunMilvus({0xc000216060?, 0x3, 0x3})
	/go/src/github.com/milvus-io/milvus/cmd/milvus/milvus.go:60 +0x21e fp=0xc00185ff58 sp=0xc00185fee0 pc=0x2e5ed5e
main.main()
	/go/src/github.com/milvus-io/milvus/cmd/main.go:26 +0x2e fp=0xc00185ff80 sp=0xc00185ff58 pc=0x2e61bae
runtime.main()
	/usr/local/go/src/runtime/proc.go:250 +0x212 fp=0xc00185ffe0 sp=0xc00185ff80 pc=0x125bff2
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc00185ffe8 sp=0xc00185ffe0 pc=0x128e6a1

Expected Behavior

all test cases pass

Steps To Reproduce

deploy Milvus
cd `tests/python_client/deploy` and run test cases:
* pytest testcases/test_action_first_deployment.py -n 2 --host 10.101.50.86
* python3 scripts/first_recall_test.py --host 10.101.50.86
* python3 scripts/action_before_reinstall.py --host 10.101.50.86 --data_size 3000

cd `tests/python_client/chaos` and run test cases:
* pytest testcases/test_concurrent_operation.py --count 5 -n 5 --host 10.101.50.86

Milvus Log

failed job: deploy_test_cron/detail/deploy_test_cron/662 log: artifacts-pulsar-cluster-reinstall-662-server-logs (1).tar.gz

artifacts-pulsar-cluster-reinstall-662-pytest-logs.tar.gz

Anything else?

No response

zhuwenxing avatar Apr 20 '23 02:04 zhuwenxing

/assign @jiaoew1991 /unassign

yanliang567 avatar Apr 20 '23 02:04 yanliang567

/unassign @jiaoew1991 /assign @aoiasd

congqixia avatar Apr 20 '23 03:04 congqixia

UnsubChannel at DataNode remove delegator first, and DataCoord will mark this job as completed and could watch this channel again, but UnsubChannel at DataNode still run for remove flow graph and tsafe manager in fact. Will cause that if we watch this channel again immediately, watch event will run with unsub event. And some new element(tsafe manager this issue) for watch will removed by unsub event before watch event.

aoiasd avatar Apr 20 '23 09:04 aoiasd

/assign @yah01 will fix this

aoiasd avatar Apr 20 '23 09:04 aoiasd

@aoiasd: GitHub didn't allow me to assign the following users: will, fix, this.

Note that only milvus-io members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign @yah01 will fix this

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sre-ci-robot avatar Apr 20 '23 09:04 sre-ci-robot

/assign @zhuwenxing plz check with #23580

yah01 avatar Apr 20 '23 14:04 yah01

Not reproduced in master-20230427-66a5efeb

zhuwenxing avatar Apr 28 '23 03:04 zhuwenxing