ssh failed sometimes
when i use mpi job, ssh: Could not resolve hostname ***-job: Name or service not known.
this usually happened, especially more workers
@zhenxiu thanks for your reporting. Would you like to provide the basic information like mpi job yaml, Volcano version and reproduce steps?
i use v1.2.0 and official mpi yaml; differently, i use my own image with mpi installed @william-wang
maybe you can check if all the workers are running when the master begin to run, Sometimes the worker may be pulling docker images
maybe you can check if all the workers are running when the master begin to run, Sometimes the worker may be pulling docker images
all workers are running
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
@zhenxiu did you find any solution for that? it happens in version 1.5.1 also with mpi-example file. With the IP ssh works fine.
https://github.com/volcano-sh/volcano/blob/629034d521cd24798fd3d43690a21250d4e5d453/example/integrations/mpi/mpi-example.yaml#L24
Please check if you have added this line
Yes. I'm using the example you sent. I verified the hosts name in the file are correct by echoing them before the command. I read in one of the issues that it might be related to the user running the image - should be root, and DNS problem.
However, I'm using an image we use to run successfully MPIJob with multi node, but since the mpi-operator project doesn't support sending volcano options (for example, minAvailable), I'm trying to move to volcano job.
Not sure it's related, but the vcjob creates a service and set its port to 1. mpijob creates a headless services without any ports specified. I tried to edit the svc when the job is already live, but it didn't affect the result.
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗