GraphScope
GraphScope copied to clipboard
Better handling of dead pods
Is your feature request related to a problem? Please describe.
- Create a cluster with N>1 nodes
- Start a session
- Start any process such as add edges.
- Before the process finishes, imitate a node unexpectedly dying (such as
kubectl delete pod --force gs-engine-graphscope-gcz2f) - The command returns an error (expected behavior)
- Kubernetes recovers the pod (expected behavior)
- Coordinator can not recover from this. the graph object and the session becomes unresponsive, closing and recreating the session does not help.
Describe the solution you'd like Once kubernetes restarts the pod, the coordinator should recover to a usable state.
Describe alternatives you've considered Currently we take down the whole deployment and reinstall it.
Additional context logs for the example above
2022-07-13 21:46:12,651 [INFO][cluster:695]: Launching etcd ...
2022-07-13 21:46:13,853 [INFO][cluster:906]: Etcd created, endpoint is 10.100.153.69:58255
2022-07-13 21:46:13,853 [INFO][cluster:927]: Creating interactive engine service...
2022-07-13 21:46:13,853 [INFO][cluster:855]: Launching zetcd proxy service ...
2022-07-13 21:46:13,853 [INFO][cluster:867]: zetcd cmd /home/graphscope/.local/lib/python3.8/site-packages/graphscope.runtime/bin/zetcd --zkaddr 0.0.0.0:2181 --endpoints http://gs-etcd-service-graphscope:58255,http://gs-etcd-graphscope-0:58255,http://gs-etcd-graphscope-1:58255,http://gs-etcd-graphscope-2:58255
Running zetcd proxy
Version: Version not provided (use make instead of go build)
SHA: SHA not provided (use make instead of go build)
2022-07-13 21:46:14,859 [INFO][cluster:896]: ZEtcd is ready, endpoint is 192.168.8.112:2181
2022-07-13 21:46:14,859 [INFO][cluster:934]: Creating engine replicaset...
2022-07-13 21:46:14,859 [INFO][cluster:534]: Launching GraphScope engines pod ...
2022-07-13 21:46:17,277 [INFO][cluster:987]: [gs-engine-graphscope-95c2n]: Successfully assigned default/gs-engine-graphscope-95c2n to ip-192-168-48-7.eu-west-3.compute.internal
2022-07-13 21:46:17,277 [INFO][cluster:987]: [gs-engine-graphscope-95c2n]: Pulling image "registry.cn-hongkong.aliyuncs.com/graphscope/graphscope:0.14.0"
2022-07-13 21:46:18,296 [INFO][cluster:987]: [gs-engine-graphscope-gcz2f]: Successfully assigned default/gs-engine-graphscope-gcz2f to ip-192-168-29-15.eu-west-3.compute.internal
2022-07-13 21:46:18,296 [INFO][cluster:987]: [gs-engine-graphscope-gcz2f]: Pulling image "registry.cn-hongkong.aliyuncs.com/graphscope/graphscope:0.14.0"
2022-07-13 21:46:18,580 [INFO][cluster:987]: [gs-engine-graphscope-gcz2f]: Successfully pulled image "registry.cn-hongkong.aliyuncs.com/graphscope/graphscope:0.14.0" in 2.301893254s
2022-07-13 21:46:18,622 [INFO][cluster:987]: [gs-engine-graphscope-gcz2f]: Created container engine
2022-07-13 21:46:18,699 [INFO][cluster:987]: [gs-engine-graphscope-gcz2f]: Started container engine
2022-07-13 21:46:19,288 [INFO][cluster:987]: [gs-engine-graphscope-qrhsd]: Successfully assigned default/gs-engine-graphscope-qrhsd to ip-192-168-20-242.eu-west-3.compute.internal
2022-07-13 21:46:19,288 [INFO][cluster:987]: [gs-engine-graphscope-qrhsd]: Pulling image "registry.cn-hongkong.aliyuncs.com/graphscope/graphscope:0.14.0"
2022-07-13 21:46:19,289 [INFO][cluster:987]: [gs-engine-graphscope-qrhsd]: Successfully pulled image "registry.cn-hongkong.aliyuncs.com/graphscope/graphscope:0.14.0" in 1.992690362s
2022-07-13 21:46:19,289 [INFO][cluster:987]: [gs-engine-graphscope-qrhsd]: Created container engine
2022-07-13 21:46:19,290 [INFO][cluster:987]: [gs-engine-graphscope-qrhsd]: Started container engine
2022-07-13 21:46:20,298 [INFO][cluster:987]: [gs-engine-graphscope-sb97w]: Successfully assigned default/gs-engine-graphscope-sb97w to ip-192-168-70-200.eu-west-3.compute.internal
2022-07-13 21:46:20,298 [INFO][cluster:987]: [gs-engine-graphscope-sb97w]: Pulling image "registry.cn-hongkong.aliyuncs.com/graphscope/graphscope:0.14.0"
2022-07-13 21:46:20,299 [INFO][cluster:987]: [gs-engine-graphscope-sb97w]: Successfully pulled image "registry.cn-hongkong.aliyuncs.com/graphscope/graphscope:0.14.0" in 2.505451138s
2022-07-13 21:46:20,299 [INFO][cluster:987]: [gs-engine-graphscope-sb97w]: Created container engine
2022-07-13 21:46:20,300 [INFO][cluster:987]: [gs-engine-graphscope-sb97w]: Started container engine
2022-07-13 21:46:20,589 [INFO][cluster:987]: [gs-engine-graphscope-sb97w]: Successfully pulled image "registry.cn-hongkong.aliyuncs.com/graphscope/graphscope:0.14.0" in 2.030411108s
2022-07-13 21:46:20,616 [INFO][cluster:987]: [gs-engine-graphscope-sb97w]: Created container vineyard
2022-07-13 21:46:20,721 [INFO][cluster:987]: [gs-engine-graphscope-sb97w]: Started container vineyard
2022-07-13 21:46:23,525 [INFO][cluster:987]: [gs-engine-graphscope-95c2n]: Successfully pulled image "registry.cn-hongkong.aliyuncs.com/graphscope/graphscope:0.14.0" in 2.449263704s
2022-07-13 21:46:23,526 [INFO][cluster:987]: [gs-engine-graphscope-95c2n]: Created container engine
2022-07-13 21:46:23,526 [INFO][cluster:987]: [gs-engine-graphscope-95c2n]: Started container engine
2022-07-13 21:46:23,527 [INFO][cluster:987]: [gs-engine-graphscope-95c2n]: Successfully pulled image "registry.cn-hongkong.aliyuncs.com/graphscope/graphscope:0.14.0" in 1.978817982s
2022-07-13 21:46:23,527 [INFO][cluster:987]: [gs-engine-graphscope-95c2n]: Created container vineyard
2022-07-13 21:46:23,528 [INFO][cluster:987]: [gs-engine-graphscope-95c2n]: Started container vineyard
2022-07-13 21:46:24,532 [INFO][cluster:987]: [gs-engine-graphscope-gcz2f]: Successfully pulled image "registry.cn-hongkong.aliyuncs.com/graphscope/graphscope:0.14.0" in 2.158486027s
2022-07-13 21:46:24,532 [INFO][cluster:987]: [gs-engine-graphscope-gcz2f]: Created container vineyard
2022-07-13 21:46:24,533 [INFO][cluster:987]: [gs-engine-graphscope-gcz2f]: Started container vineyard
2022-07-13 21:46:25,539 [INFO][cluster:987]: [gs-engine-graphscope-qrhsd]: Successfully pulled image "registry.cn-hongkong.aliyuncs.com/graphscope/graphscope:0.14.0" in 2.089621532s
2022-07-13 21:46:25,540 [INFO][cluster:987]: [gs-engine-graphscope-qrhsd]: Created container vineyard
2022-07-13 21:46:25,540 [INFO][cluster:987]: [gs-engine-graphscope-qrhsd]: Started container vineyard
2022-07-13 21:46:29,842 [INFO][cluster:987]: [gs-engine-graphscope-95c2n]: Readiness probe failed:
2022-07-13 21:46:30,726 [INFO][cluster:987]: [gs-engine-graphscope-gcz2f]: Readiness probe failed:
2022-07-13 21:46:31,759 [INFO][cluster:987]: [gs-engine-graphscope-qrhsd]: Readiness probe failed:
2022-07-13 21:46:32,735 [INFO][cluster:987]: [gs-engine-graphscope-sb97w]: Readiness probe failed:
2022-07-13 21:46:42,371 [INFO][cluster:1025]: GraphScope engines pod is ready.
2022-07-13 21:46:42,375 [INFO][cluster:1172]: Engines pod name list: ['gs-engine-graphscope-95c2n', 'gs-engine-graphscope-gcz2f', 'gs-engine-graphscope-qrhsd', 'gs-engine-graphscope-sb97w']
2022-07-13 21:46:42,375 [INFO][cluster:1173]: Engines pod ip list: ['192.168.49.184', '192.168.15.183', '192.168.21.224', '192.168.67.171']
2022-07-13 21:46:42,375 [INFO][cluster:1174]: Engines pod host ip list: ['192.168.48.7', '192.168.29.15', '192.168.20.242', '192.168.70.200']
2022-07-13 21:46:42,375 [INFO][cluster:1175]: Vineyard service endpoint: 192.168.48.7:32346
2022-07-13 21:46:42,375 [INFO][cluster:1049]: Starting GAE rpc service on 192.168.49.184:56773 ...
2022-07-13 21:46:43,468 [INFO][cluster:1095]: Analytical engine launching command: /home/graphscope/.local/lib/python3.8/site-packages/graphscope.runtime/openmpi/bin/mpirun --allow-run-as-root -n 4 -host gs-engine-graphscope-95c2n:1.0,gs-engine-graphscope-gcz2f:1.0,gs-engine-graphscope-qrhsd:1.0,gs-engine-graphscope-sb97w:1.0 /home/graphscope/.local/lib/python3.8/site-packages/graphscope.runtime/bin/grape_engine --host 0.0.0.0 --port 56773 --vineyard_shared_mem 245Gi -v 1 --vineyard_socket /tmp/vineyard_workspace/vineyard.sock
2022-07-13 21:46:43,474 [INFO][coordinator:197]: Java initial class path set to: /home/graphscope/.local/lib/python3.8/site-packages/graphscope.runtime/lib/grape-runtime-0.1-shaded.jar
2022-07-13 21:46:43,477 [INFO][coordinator:1742]: Coordinator server listen at 0.0.0.0:59001
192.168.49.184 gs-engine-graphscope-95c2n
192.168.15.183 gs-engine-graphscope-gcz2f
192.168.21.224 gs-engine-graphscope-qrhsd
192.168.67.171 gs-engine-graphscope-sb97w
192.168.49.184 gs-engine-graphscope-95c2n
192.168.15.183 gs-engine-graphscope-gcz2f
192.168.21.224 gs-engine-graphscope-qrhsd
192.168.67.171 gs-engine-graphscope-sb97w
192.168.49.184 gs-engine-graphscope-95c2n
192.168.15.183 gs-engine-graphscope-gcz2f
192.168.21.224 gs-engine-graphscope-qrhsd
192.168.67.171 gs-engine-graphscope-sb97w
192.168.49.184 gs-engine-graphscope-95c2n
192.168.15.183 gs-engine-graphscope-gcz2f
192.168.21.224 gs-engine-graphscope-qrhsd
192.168.67.171 gs-engine-graphscope-sb97w
I0713 21:46:44.000000 87 /work/analytical_engine/core/grape_instance.cc:86] Workers of grape-engine initialized.
I0713 21:46:44.000000 90 /work/analytical_engine/core/server/analytical_server.cc:43] Analytical server is listening on 0.0.0.0:56773
I0713 21:48:37.000000 91 /work/analytical_engine/core/grape_instance.cc:1178] Registering Graph, graph type: ARROW_PROPERTY, Type sigature: e33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521, lib path: /home/graphscope/.local/lib/python3.8/site-packages/graphscope.runtime/precompiled/builtin/e33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521/libe33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521.so
I0713 21:48:37.000000 90 /work/analytical_engine/core/grape_instance.cc:1178] Registering Graph, graph type: ARROW_PROPERTY, Type sigature: e33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521, lib path: /home/graphscope/.local/lib/python3.8/site-packages/graphscope.runtime/precompiled/builtin/e33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521/libe33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521.so
I0713 21:48:37.000000 90 /work/analytical_engine/core/grape_instance.cc:1178] Registering Graph, graph type: ARROW_PROPERTY, Type sigature: e33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521, lib path: /home/graphscope/.local/lib/python3.8/site-packages/graphscope.runtime/precompiled/builtin/e33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521/libe33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521.so
I0713 21:48:37.000000 90 /work/analytical_engine/core/grape_instance.cc:1178] Registering Graph, graph type: ARROW_PROPERTY, Type sigature: e33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521, lib path: /home/graphscope/.local/lib/python3.8/site-packages/graphscope.runtime/precompiled/builtin/e33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521/libe33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521.so
I0713 21:48:37.000000 90 /work/analytical_engine/core/grape_instance.cc:143] Loading graph, graph name: graph_a37JncCH, graph type: ArrowFragment, type sig: e33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521
I0713 21:48:37.000000 91 /work/analytical_engine/core/grape_instance.cc:143] Loading graph, graph name: graph_a37JncCH, graph type: ArrowFragment, type sig: e33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521
I0713 21:48:37.000000 90 /work/analytical_engine/core/grape_instance.cc:143] Loading graph, graph name: graph_a37JncCH, graph type: ArrowFragment, type sig: e33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521
I0713 21:48:37.000000 90 /work/analytical_engine/core/grape_instance.cc:143] Loading graph, graph name: graph_a37JncCH, graph type: ArrowFragment, type sig: e33529e80839a2064a804ce453c761a9483aa7ab775bcfddc1a1f9da63dcb521
Loading empty graph: 100%|██████████| 10/10 [00:00<00:00, 45.89it/s]
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[761,0],0] on node graphscope-coordinator-7654445dfb-czjs7
Remote daemon: [[761,0],2] on node gs-engine-graphscope-gcz2f
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
command terminated with exit code 137
Loading edge labeled edge: 0%| | 0/10 [00:00<?, ?it/s]2022-07-13 21:49:47,476 [ERROR][coordinator:432]: Engine RunStep failed, code: UNAVAILABLE, details: Socket closed
2022-07-13 21:49:47,476 [ERROR][coordinator:432]: Engine RunStep failed, code: UNAVAILABLE, details: Socket closed
2022-07-13 21:51:59,985 [ERROR][coordinator:432]: Engine RunStep failed, code: UNAVAILABLE, details: failed to connect to all addresses
2022-07-13 21:51:59,985 [ERROR][coordinator:432]: Engine RunStep failed, code: UNAVAILABLE, details: failed to connect to all addresses
2022-07-13 21:52:00,059 [ERROR][coordinator:432]: Engine RunStep failed, code: UNAVAILABLE, details: failed to connect to all addresses
2022-07-13 21:52:00,059 [ERROR][coordinator:432]: Engine RunStep failed, code: UNAVAILABLE, details: failed to connect to all addresses
2022-07-13 21:52:04,587 [ERROR][coordinator:432]: Engine RunStep failed, code: UNAVAILABLE, details: failed to connect to all addresses
2022-07-13 21:52:04,587 [ERROR][coordinator:432]: Engine RunStep failed, code: UNAVAILABLE, details: failed to connect to all addresses