volcano icon indicating copy to clipboard operation
volcano copied to clipboard

ssh failed sometimes

Open zhenxiu opened this issue 4 years ago • 9 comments

when i use mpi job, ssh: Could not resolve hostname ***-job: Name or service not known.

this usually happened, especially more workers

zhenxiu avatar Jan 18 '22 08:01 zhenxiu

@zhenxiu thanks for your reporting. Would you like to provide the basic information like mpi job yaml, Volcano version and reproduce steps?

william-wang avatar Jan 19 '22 01:01 william-wang

i use v1.2.0 and official mpi yaml; differently, i use my own image with mpi installed @william-wang

zhenxiu avatar Jan 19 '22 02:01 zhenxiu

maybe you can check if all the workers are running when the master begin to run, Sometimes the worker may be pulling docker images

yongqiangz avatar Jan 23 '22 04:01 yongqiangz

maybe you can check if all the workers are running when the master begin to run, Sometimes the worker may be pulling docker images

all workers are running

zhenxiu avatar Jan 24 '22 02:01 zhenxiu

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Apr 24 '22 03:04 stale[bot]

@zhenxiu did you find any solution for that? it happens in version 1.5.1 also with mpi-example file. With the IP ssh works fine.

snirkop89 avatar Apr 25 '22 14:04 snirkop89

https://github.com/volcano-sh/volcano/blob/629034d521cd24798fd3d43690a21250d4e5d453/example/integrations/mpi/mpi-example.yaml#L24

Please check if you have added this line

hwdef avatar Apr 26 '22 01:04 hwdef

Yes. I'm using the example you sent. I verified the hosts name in the file are correct by echoing them before the command. I read in one of the issues that it might be related to the user running the image - should be root, and DNS problem.

However, I'm using an image we use to run successfully MPIJob with multi node, but since the mpi-operator project doesn't support sending volcano options (for example, minAvailable), I'm trying to move to volcano job.

Not sure it's related, but the vcjob creates a service and set its port to 1. mpijob creates a headless services without any ports specified. I tried to edit the svc when the job is already live, but it didn't affect the result.

snirkop89 avatar Apr 26 '22 07:04 snirkop89

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Jul 30 '22 18:07 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

stale[bot] avatar Oct 01 '22 00:10 stale[bot]