milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Proxy pod can not recover with error `Failed to enqueue delete task: syncTimestamp Failed:context canceled` after etcd pod kill chaos test

Open zhuwenxing opened this issue 2 years ago • 2 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: master-20220820-9589bd3
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.2.0.dev19
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

running test_e2e.py failed with error Fail connecting to server on 10.101.48.180:19530

image

[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/impl.go:2365] ["Failed to enqueue insert task: syncTimestamp Failed:context canceled"]
[2022/08/21 13:58:55.668 +00:00] [INFO] [proxy/impl.go:2368] ["Finish processing insert request in Proxy"] [traceID=5cba86146820479d]
[2022/08/21 13:58:55.668 +00:00] [ERROR] [proxy/impl.go:2467] ["Failed to enqueue delete task: syncTimestamp Failed:context canceled"] [traceID=19bb2e3081bcbf1] [stack="github.com/milvus-io/milvus/internal/proxy.(*Proxy).Delete\n\t/go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:2467\ngithub.com/milvus-io/milvus/internal/distributed/proxy.(*Server).Delete\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:666\ngithub.com/milvus-io/milvus/internal/proto/milvuspb._MilvusService_Delete_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/milvuspb/milvus.pb.go:8626\ngithub.com/milvus-io/milvus/internal/proxy.UnaryServerInterceptor.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proxy/privilege_interceptor.go:67\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware/auth.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/auth/auth.go:47\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/tracing/opentracing/server_interceptors.go:38\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34\ngithub.com/milvus-io/milvus/internal/proto/milvuspb._MilvusService_Delete_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/milvuspb/milvus.pb.go:8628\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:922"]
[2022/08/21 13:58:55.668 +00:00] [INFO] [proxy/impl.go:2471] ["Finish processing delete request in Proxy"] [traceID=19bb2e3081bcbf1]
[2022/08/21 13:58:55.668 +00:00] [ERROR] [proxy/impl.go:2467] ["Failed to enqueue delete task: syncTimestamp Failed:context canceled"] [traceID=7b52aa1ad2f02e54] [stack="github.com/milvus-io/milvus/internal/proxy.(*Proxy).Delete\n\t/go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:2467\ngithub.com/milvus-io/milvus/internal/distributed/proxy.(*Server).Delete\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:666\ngithub.com/milvus-io/milvus/internal/proto/milvuspb._MilvusService_Delete_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/milvuspb/milvus.pb.go:8626\ngithub.com/milvus-io/milvus/internal/proxy.UnaryServerInterceptor.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proxy/privilege_interceptor.go:67\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware/auth.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/auth/auth.go:47\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/tracing/opentracing/server_interceptors.go:38\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34\ngithub.com/milvus-io/milvus/internal/proto/milvuspb._MilvusService_Delete_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/milvuspb/milvus.pb.go:8628\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:922"]
[2022/08/21 13:58:55.668 +00:00] [INFO] [proxy/impl.go:2471] ["Finish processing delete request in Proxy"] [traceID=7b52aa1ad2f02e54]
[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/task_scheduler.go:257] ["Proxy dmTaskQueue popPChanStats"] [taskID=435444694659157828]
[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/task_scheduler.go:257] ["Proxy dmTaskQueue popPChanStats"] [taskID=435444694659157793]
[2022/08/21 13:58:55.668 +00:00] [INFO] [proxy/proxy.go:354] ["close id allocator"] [role=proxy]
[2022/08/21 13:58:55.668 +00:00] [INFO] [proxy/proxy.go:359] ["close segment id assigner"] [role=proxy]
[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/task_scheduler.go:257] ["Proxy dmTaskQueue popPChanStats"] [taskID=435444694659157880]
[2022/08/21 13:58:55.668 +00:00] [INFO] [proxy/proxy.go:364] ["close scheduler"] [role=proxy]
[2022/08/21 13:58:55.668 +00:00] [INFO] [proxy/proxy.go:372] ["close channels time ticker"] [role=proxy]
[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/task_scheduler.go:257] ["Proxy dmTaskQueue popPChanStats"] [taskID=435444694659157856]
[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/task_scheduler.go:257] ["Proxy dmTaskQueue popPChanStats"] [taskID=435444694659157895]
[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/task_scheduler.go:257] ["Proxy dmTaskQueue popPChanStats"] [taskID=435444694659157912]
[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/task_scheduler.go:257] ["Proxy dmTaskQueue popPChanStats"] [taskID=435444694659157748]
[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/task_scheduler.go:257] ["Proxy dmTaskQueue popPChanStats"] [taskID=435444694659158612]
[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/task_scheduler.go:257] ["Proxy dmTaskQueue popPChanStats"] [taskID=435444694659158005]
[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/task_scheduler.go:257] ["Proxy dmTaskQueue popPChanStats"] [taskID=435444694659158623]
[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/task_scheduler.go:257] ["Proxy dmTaskQueue popPChanStats"] [taskID=435444694659158601]
[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/task_scheduler.go:257] ["Proxy dmTaskQueue popPChanStats"] [taskID=435444694659158206]
[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/task_scheduler.go:257] ["Proxy dmTaskQueue popPChanStats"] [taskID=435444694659158641]
[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/task_scheduler.go:257] ["Proxy dmTaskQueue popPChanStats"] [taskID=435444694659158652]
[2022/08/21 13:58:55.668 +00:00] [DEBUG] [proxy/task_scheduler.go:257] ["Proxy dmTaskQueue popPChanStats"] [taskID=435444694659158663]
{"level":"warn","ts":"2022-08-21T13:58:55.723Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00079a8c0/#initially=[etcd-pod-kill-600:2379]","attempt":0,"error":"rpc error: code = NotFound desc = etcdserver: requested lease not found"}

Expected Behavior

all test cases passed

Steps To Reproduce

see flow graph in `Anything else` below.

Milvus Log

artifacts-etcd-pod-kill-600-server-logs (1).tar.gz

artifacts-etcd-pod-kill-600-pytest-logs.tar.gz

Anything else?

image

zhuwenxing avatar Aug 22 '22 03:08 zhuwenxing

/assign @soothing-rain /unassign

yanliang567 avatar Aug 23 '22 01:08 yanliang567

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Sep 22 '22 04:09 stale[bot]

/reopen

It appears that proxy failed to connect RootCoord and was killed later.

[2022/08/21 13:58:55.630 +00:00] [WARN] [grpcclient/client.go:251] ["rootcoord ClientBase ReCall grpc first call get error "] [error="err: failed to connect 10.102.7.141:53100, reason: context deadline exceeded\n, /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:250 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/rootcoord/client/client.go:350 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.(*Client).AllocTimestamp\n/go/src/github.com/milvus-io/milvus/internal/proxy/timestamp.go:61 github.com/milvus-io/milvus/internal/proxy.(*timestampAllocator).alloc\n/go/src/github.com/milvus-io/milvus/internal/proxy/timestamp.go:84 github.com/milvus-io/milvus/internal/proxy.(*timestampAllocator).AllocOne\n/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:173 github.com/milvus-io/milvus/internal/proxy.(*baseTaskQueue).Enqueue\n/go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:2802 github.com/milvus-io/milvus/internal/proxy.(*Proxy).Query\n/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:678 github.com/milvus-io/milvus/internal/distributed/proxy.(*Server).Query\n/go/src/github.com/milvus-io/milvus/internal/proto/milvuspb/milvus.pb.go:8680 github.com/milvus-io/milvus/internal/proto/milvuspb._MilvusService_Query_Handler.func1\n/go/src/github.com/milvus-io/milvus/internal/proxy/privilege_interceptor.go:67 github.com/milvus-io/milvus/internal/proxy.UnaryServerInterceptor.func1\n"]
[2022/08/21 13:58:55.630 +00:00] [WARN] [proxy/impl.go:2803] ["Query failed to enqueue"] [error="syncTimestamp Failed:context deadline exceeded"] [traceID=6134af0a7ff8b08b] [role=proxy] [db=] [collection=Checker__eo6hlrN8] [partitions="[]"]
[2022/08/21 13:58:55.663 +00:00] [ERROR] [roles/roles.go:527] ["Get signal to exit\n"] [signal=interrupt] [stack="github.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).Run\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:527\ngithub.com/milvus-io/milvus/cmd/milvus.(*run).execute\n\t/go/src/github.com/milvus-io/milvus/cmd/milvus/run.go:111\ngithub.com/milvus-io/milvus/cmd/milvus.RunMilvus\n\t/go/src/github.com/milvus-io/milvus/cmd/milvus/milvus.go:60\nmain.main\n\t/go/src/github.com/milvus-io/milvus/cmd/main.go:26\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"]
[2022/08/21 13:58:55.663 +00:00] [DEBUG] [proxy/service.go:503] ["Proxy stop"] ["internal address"=10.102.9.65:19529] ["external address"=10.102.9.65:19529]

I am also confused that the logs above are found in pod_log_previous directory, but the proxy logs in pod_log directory seem alright:

[2022/08/21 13:58:57.128 +00:00] [DEBUG] [proxy/service.go:293] ["init Proxy server done"]
[2022/08/21 13:58:57.128 +00:00] [DEBUG] [proxy/service.go:295] ["start Proxy server"]
[2022/08/21 13:58:57.128 +00:00] [DEBUG] [proxy/proxy.go:303] ["start task scheduler"] [role=proxy]
[2022/08/21 13:58:57.128 +00:00] [DEBUG] [proxy/proxy.go:308] ["start task scheduler done"] [role=proxy]
[2022/08/21 13:58:57.128 +00:00] [DEBUG] [proxy/proxy.go:310] ["start id allocator"] [role=proxy]
[2022/08/21 13:58:57.128 +00:00] [DEBUG] [proxy/proxy.go:315] ["start id allocator done"] [role=proxy]
[2022/08/21 13:58:57.128 +00:00] [DEBUG] [proxy/proxy.go:317] ["start segment id assigner"] [role=proxy]
[2022/08/21 13:58:57.128 +00:00] [DEBUG] [proxy/proxy.go:322] ["start segment id assigner done"] [role=proxy]
[2022/08/21 13:58:57.128 +00:00] [DEBUG] [proxy/proxy.go:324] ["start channels time ticker"] [role=proxy]
[2022/08/21 13:58:57.128 +00:00] [DEBUG] [proxy/proxy.go:329] ["start channels time ticker done"] [role=proxy]
[2022/08/21 13:58:57.128 +00:00] [DEBUG] [proxy/proxy.go:342] ["update state code"] [role=proxy] [State=Healthy]
[2022/08/21 13:58:57.128 +00:00] [DEBUG] [sessionutil/session_util.go:200] ["service begin to register to etcd"] [serverName=proxy] [ServerID=22]
[2022/08/21 13:58:57.131 +00:00] [INFO] [sessionutil/session_util.go:241] ["Service registered successfully"] [ServerName=proxy] [serverID=22]
[2022/08/21 13:58:57.131 +00:00] [DEBUG] [proxy/service.go:300] ["start Proxy server done"]
[2022/08/21 13:58:57.131 +00:00] [INFO] [components/proxy.go:54] ["Proxy successfully started"]

soothing-rain avatar Sep 24 '22 08:09 soothing-rain

/assign @zhuwenxing /unassign

soothing-rain avatar Sep 24 '22 08:09 soothing-rain

Not reproduced, so close it.

zhuwenxing avatar Oct 11 '22 12:10 zhuwenxing