GraphScope
GraphScope copied to clipboard
[BUG] Failed to start Graphscope Analytical Engine on EKS with num_worker=2
Describe the bug Can't start Analytical.
To Reproduce Steps to reproduce the behavior:
- Install Graphscope helm chart with following values
engines:
num_workers: 2
enabled_engines: "analytical,interactive"
- Run
sess = graphscope.session(addr='coordinator-service-graphscope:59001')
Expected behavior Start graphscope session successfully.
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
- GraphScope version: 0.25.0
- OS: MacOS
- Kubernetes Version: 1.28.2-eks
Additional context Log
2023-11-13 11:49:15,556 [INFO][session:597]: Connecting graphscope session with address: localhost:59001
2023-11-13 11:49:15,568 [INFO][rpc:69]: GraphScope coordinator service connected.
2023-11-13 03:49:15,570 [INFO][utils:321]: Running command: kubectl cp /tmp/hosts_of_nodes gs-engine-graphscope-0:/tmp/hosts_of_nodes -c engine --retries=5, cwd: None
2023-11-13 03:49:15,897 [DEBUG][kubernetes_launcher:1027]:
2023-11-13 03:49:15,897 [INFO][utils:321]: Running command: kubectl cp /tmp/hosts_of_nodes gs-engine-graphscope-1:/tmp/hosts_of_nodes -c engine --retries=5, cwd: None
2023-11-13 03:49:16,172 [DEBUG][kubernetes_launcher:1027]:
2023-11-13 03:49:16,630 [DEBUG][utils:1999]: Resolve mpi cmd prefix: /usr/bin/mpirun --allow-run-as-root --bind-to none -n 2 -host gs-engine-graphscope-0:1.0,gs-engine-graphscope-1:1.0
2023-11-13 03:49:16,630 [DEBUG][utils:2000]: Resolve mpi env: {"OMPI_MCA_btl_vader_single_copy_mechanism": "none", "OMPI_MCA_orte_allowed_exit_without_sync": "1", "OMPI_MCA_odls_base_sigkill_timeout": "0", "OMPI_MCA_plm_rsh_agent": "/usr/local/bin/kube_ssh"}
2023-11-13 03:49:16,630 [INFO][kubernetes_launcher:1043]: Analytical engine launching command: /usr/bin/mpirun --allow-run-as-root --bind-to none -n 2 -host gs-engine-graphscope-0:1.0,gs-engine-graphscope-1:1.0 /opt/graphscope/bin/grape_engine --host 0.0.0.0 --port 56769 -v 10 --vineyard_socket /tmp/vineyard_workspace/vineyard.sock
2023-11-13 03:49:16,632 [INFO][kubernetes_launcher:1077]: GAE rpc service is listening on 100.64.108.192:56769 ...
2023-11-13 03:49:16,638 [WARNING][op_executor:356]: Connecting to analytical engine... tried 1 time, will retry in 2 seconds
2023-11-13 03:49:16,638 [WARNING][op_executor:361]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses; last error: UNKNOWN: ipv4:100.64.108.192:56769: Failed to connect to remote host: Connection refused
100.64.108.192 gs-engine-graphscope-0
100.64.99.64 gs-engine-graphscope-1
100.64.108.192 gs-engine-graphscope-0
100.64.99.64 gs-engine-graphscope-1
2023-11-13 03:49:18,641 [WARNING][op_executor:356]: Connecting to analytical engine... tried 2 time, will retry in 4 seconds
2023-11-13 03:49:18,642 [WARNING][op_executor:361]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses; last error: UNKNOWN: ipv4:100.64.108.192:56769: Failed to connect to remote host: Connection refused
2023-11-13 03:49:22,645 [WARNING][op_executor:356]: Connecting to analytical engine... tried 3 time, will retry in 8 seconds
2023-11-13 03:49:22,645 [WARNING][op_executor:361]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses; last error: UNKNOWN: ipv4:100.64.108.192:56769: Failed to connect to remote host: Connection refused
2023-11-13 03:49:30,654 [WARNING][op_executor:356]: Connecting to analytical engine... tried 4 time, will retry in 16 seconds
2023-11-13 03:49:30,654 [WARNING][op_executor:361]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses; last error: UNKNOWN: ipv4:100.64.108.192:56769: Failed to connect to remote host: Connection refused
2023-11-13 03:49:46,667 [WARNING][op_executor:356]: Connecting to analytical engine... tried 5 time, will retry in 32 seconds
2023-11-13 03:49:46,667 [WARNING][op_executor:361]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses; last error: UNKNOWN: ipv4:100.64.108.192:56769: Failed to connect to remote host: Connection refused
2023-11-13 03:50:18,699 [WARNING][op_executor:356]: Connecting to analytical engine... tried 6 time, will retry in 64 seconds
2023-11-13 03:50:18,699 [WARNING][op_executor:361]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses; last error: UNKNOWN: ipv4:100.64.108.192:56769: Failed to connect to remote host: Connection refused
2023-11-13 03:51:22,764 [WARNING][op_executor:356]: Connecting to analytical engine... tried 7 time, will retry in 128 seconds
2023-11-13 03:51:22,764 [WARNING][op_executor:361]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses; last error: UNKNOWN: ipv4:100.64.108.192:56769: Failed to connect to remote host: Connection refused
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: gs-engine-graphscope-1
PID: 1159
Message: connect() to 100.64.108.192:1024 failed
Error: Operation now in progress (115)
--------------------------------------------------------------------------
[coordinator-graphscope-6d768c9494-wn4jj:00045] 1 more process has sent help message help-mpi-btl-tcp.txt / client connect fail
[coordinator-graphscope-6d768c9494-wn4jj:00045] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages