ompi
ompi copied to clipboard
mpirun issues the warning "no protocol specified"
Thank you for taking the time to submit an issue!
Background information
I have updated a machine to Ubuntu 20.04 LTS which involved an update to OpenMPI 4.0.3. When starting something with mpirun I get a message in stderr "no protocol specified". This has not happened in earlier versions. I do not find any information on this warning in the documentation or elsewhere in the web.
I am running openmpi on a single computer
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.0.3 (with warning) v2.1.1 (no warning)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
On Ubuntu 19.10: apt install openmpi Release update to Ubuntu 20.04
Please describe the system on which you are running
- Operating system/version: Ubuntu 20.04
- Computer hardware: Single AMD Rome 32 Cores / 64 Threads
- Network type: Ethernet
Details of the problem
Using this simple test script
test.sh:
#!/bin/bash
X=$(( $RANDOM % 5 ))
echo $OMPI_COMM_WORLD_RANK / $OMPI_COMM_WORLD_SIZE waits for $X seconds
sleep $X
echo $OMPI_COMM_WORLD_RANK / $OMPI_COMM_WORLD_SIZE ready
mpirun -c 6 test.sh >test.log
I get the message
No protocol specified
and the expected output in test.log
With openmpi 2.1.1 I get (as expected) no output to stderr
We are having this same message in stderr
which is quite annoying.
We had our backend respond to any stderr
messages, but now we need to apply some specific filtering to ignore this message.
FWIW: I grep'd OMPI and that error message doesn't appear anywhere in our code. Might be some 3rd-party library we link against. I'll ask around.
This is a SSH related warning (X11 related)
If you are using a release series, you can try (disclaimer, AFK/untested)
mpirun --mca pml_rsh_args -x ...
and see if it helps
@dromer @philippkraft Did this resolve the problem? I so, we can look to see if there is something we can do to make it automatic.
Please reopen if this didn't fix the problem, and/or if someone comes up with a way we can automate addition of the necessary option.
Sorry, totally forgot to follow up on this.
This option gives us: mpirun: Error: unknown option "--mca pml_rsh_args"
This is with 4.0.3
btw.
I believe you were given a typo: it should be --mca plm_rsh_args -x
Sorry, forgot about this.
when running
mpirun --mca plm_rsh_args -x -np 3 ...
I get the exact same output as in the description of the issue.
This happens both with PuTTy as ssh client on windows with -X flag, as well as with ssh from the Windows Subsystem for Linux based on Ubuntu 18.04 without -X flag.
I have this issue when connected via xrdp. mpirun --mca plm_rsh_args -x
seems to have no effect. Please re-open.
I've the same problem. But server had not been logged to Desktop GUI. After, I've repeated the command and everything is fine now.
@rafaeltiveron Could you explain what you mean by 'logged to desktop GUI'? Did you set a logging setting somewhere?
Yes. But I didn't explain completely. Using a simple mpiexec command through ssh in server, the term "no protocol specified" disappeared just when GUI interface is active in server, when user area is logged. Maybe this problem is related to X11 ou X.org. Ah, an observation: Linux Mint is dealing better with openmpi.
Re-opening as there still seems to be some unresolved issues .
I believe that @ggouaillardet correctly identified the issue in https://github.com/open-mpi/ompi/issues/7701#issuecomment-788412673: this appears to be an X11 configuration issue. I.e., something is trying to setup X11 forwarding, and failing.
You could try running:
mpirun --mca plm_base_verbose 100 ... 2>&1 | tee out.txt
And looking in detail at the resulting out.txt
file to see the exact ssh
command that mpirun
invoked. If you add in --mca plm_rsh_args -x
to that mpirun
command line, you should see the -x
get added to the ssh
command invoked by mpirun
(-x
tells ssh
to disable X11 forwarding).
If it's not ssh
itself that is trying to setup X11 forwarding, then it's possible that there's something in your local or remote environment or shell startup files that is trying to do something with X11 and failing (i.e., outside of Open MPI's control altogether).
Hello, I am running into the same issue.
I am using default Ubuntu openmpi-bin installation, no anything more.
I am using Ubuntu Impish (21.20) in the front end [yann-MBP], and was using ubuntu Groovy (20.10) on the nodes before, where this did happen. I upgraded the nodes to Hirsute (21.04) and now I face the same issue. Nodes are [bmax-Intel] and [yann-pc].
code:
mpirun --mca plm_base_verbose 100 --use-hwthread-cpus --hostfile /home/mpiuser/cloud/hostfile /home/mpiuser/cloud/hello 2>&1 | tee out.txt
result:
No protocol specified
[yann-MBP:24149] mca: base: components_register: registering framework plm components
[yann-MBP:24149] mca: base: components_register: found loaded component slurm
[yann-MBP:24149] mca: base: components_register: component slurm register function successful
[yann-MBP:24149] mca: base: components_register: found loaded component isolated
[yann-MBP:24149] mca: base: components_register: component isolated has no register or open function
[yann-MBP:24149] mca: base: components_register: found loaded component rsh
[yann-MBP:24149] mca: base: components_register: component rsh register function successful
[yann-MBP:24149] mca: base: components_open: opening plm components
[yann-MBP:24149] mca: base: components_open: found loaded component slurm
[yann-MBP:24149] mca: base: components_open: component slurm open function successful
[yann-MBP:24149] mca: base: components_open: found loaded component isolated
[yann-MBP:24149] mca: base: components_open: component isolated open function successful
[yann-MBP:24149] mca: base: components_open: found loaded component rsh
[yann-MBP:24149] mca: base: components_open: component rsh open function successful
[yann-MBP:24149] mca:base:select: Auto-selecting plm components
[yann-MBP:24149] mca:base:select:( plm) Querying component [slurm]
[yann-MBP:24149] mca:base:select:( plm) Querying component [isolated]
[yann-MBP:24149] mca:base:select:( plm) Query of component [isolated] set priority to 0
[yann-MBP:24149] mca:base:select:( plm) Querying component [rsh]
[yann-MBP:24149] mca:base:select:( plm) Query of component [rsh] set priority to 10
[yann-MBP:24149] mca:base:select:( plm) Selected component [rsh]
[yann-MBP:24149] mca: base: close: component slurm closed
[yann-MBP:24149] mca: base: close: unloading component slurm
[yann-MBP:24149] mca: base: close: component isolated closed
[yann-MBP:24149] mca: base: close: unloading component isolated
No protocol specified
[yann-MBP:24149] [[5385,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> orted -mca ess "env" -mca ess_base_jobid "352911360" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "3" -mca orte_node_regex "yann-MBP,worker[1:1,3]@0(3)" -mca orte_hnp_uri "352911360.0;tcp://192.168.43.102:56797" --mca plm_base_verbose "100" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "352911360.0;tcp://192.168.43.102:56797" -mca hwloc_base_use_hwthreads_as_cpus "1" -mca pmix "^s1,s2,cray,isolated"
failed to open /dev/dri/renderD128: Permission denied
failed to open /dev/dri/renderD128: Permission denied
No protocol specified
[yann-pc:01745] mca: base: components_register: registering framework plm components
[yann-pc:01745] mca: base: components_register: found loaded component rsh
[yann-pc:01745] mca: base: components_register: component rsh register function successful
[yann-pc:01745] mca: base: components_open: opening plm components
[yann-pc:01745] mca: base: components_open: found loaded component rsh
[yann-pc:01745] mca: base: components_open: component rsh open function successful
[yann-pc:01745] mca:base:select: Auto-selecting plm components
[yann-pc:01745] mca:base:select:( plm) Querying component [rsh]
[yann-pc:01745] mca:base:select:( plm) Query of component [rsh] set priority to 10
[yann-pc:01745] mca:base:select:( plm) Selected component [rsh]
No protocol specified
No protocol specified
[bmax-Intel:116058] mca: base: components_register: registering framework plm components
[bmax-Intel:116058] mca: base: components_register: found loaded component rsh
[bmax-Intel:116058] mca: base: components_register: component rsh register function successful
[bmax-Intel:116058] mca: base: components_open: opening plm components
[bmax-Intel:116058] mca: base: components_open: found loaded component rsh
[bmax-Intel:116058] mca: base: components_open: component rsh open function successful
[bmax-Intel:116058] mca:base:select: Auto-selecting plm components
[bmax-Intel:116058] mca:base:select:( plm) Querying component [rsh]
[bmax-Intel:116058] mca:base:select:( plm) Query of component [rsh] set priority to 10
[bmax-Intel:116058] mca:base:select:( plm) Selected component [rsh]
No protocol specified
[yann-MBP:24149] [[5385,0],0] complete_setup on job [5385,1]
[yann-MBP:24149] [[5385,0],0] ORTE_ERROR_LOG: Not found in file ../../../../../orte/mca/odls/base/odls_base_default_fns.c at line 292
munmap_chunk(): invalid pointer
[bmax-Intel:116058] mca: base: close: component rsh closed
[bmax-Intel:116058] mca: base: close: unloading component rsh
[yann-pc:01745] mca: base: close: component rsh closed
[yann-pc:01745] mca: base: close: unloading component rsh
using --mca plm_rsh_args -x
has no effect at all indeed.
forcing front-end and all nodes to not use X11 forwarding has not effect.
/etc/ssh/ssh_config => ForwardX11 no
These lines in your output indicate that the -x
was not added:
[yann-MBP:24149] [[5385,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> orted -mca ess "env" -mca ess_base_jobid "352911360" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "3" -mca orte_node_regex "yann-MBP,worker[1:1,3]@0(3)" -mca orte_hnp_uri "352911360.0;tcp://192.168.43.102:56797" --mca plm_base_verbose "100" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "352911360.0;tcp://192.168.43.102:56797" -mca hwloc_base_use_hwthreads_as_cpus "1" -mca pmix "^s1,s2,cray,isolated"
The <template>
is later replaced with the hostname. In v4.1.x, any user-specified ssh arguments would have been added after ssh
and before <template>
per https://github.com/open-mpi/ompi/blob/76d00f65e6f9d8df620b0aa4da49c04e9d3c9f1f/orte/mca/plm/rsh/plm_rsh_module.c#L386-L394
For example:
$ mpirun --version
mpirun (Open MPI) 4.1.2rc2
Report bugs to http://www.open-mpi.org/community/help/
$ mpirun --mca plm_rsh_args -x --mca plm_base_verbose 100 --host mpi004,mpi005 uptime>&! out.txt
$ grep ssh out.txt
[savbu-usnic-a:20233] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[savbu-usnic-a:20233] [[7695,0],0] plm:rsh_setup on agent ssh : rsh path NULL
/usr/bin/ssh -x <template> PATH=/home/jsquyres/bogus/bin:$PATH ; export PATH ; LD_LIBRARY ...etc.
Note the -x
after ssh
.
So if you can run with mpirun --mca plm_rsh_args -x --mca plm_base_verbose 100 ...
and see the -x
in the ssh
command line in the verbose output, but you're still seeing the No protocol specified
, then perhaps ssh
is being told to use X11 forwarding from something else...? Or perhaps there's something else that's emitting No protocol specified
...?
If you just /usr/bin/ssh HOSTNAME uptime
(in the same shell where you see the errant No protocol specified
output from mpirun
), what output do you see? (filling in the appropriate HOSTNAME
, of course)
i have already tried the mca option with x it changed nothing
about the ssh worker1 uptime, it works well, the No protocol behaviour is not there.
On Tue, Sep 28, 2021, 7:39 PM Jeff Squyres @.***> wrote:
These lines in your output indicate that the -x was not added:
[yann-MBP:24149] [[5385,0],0] plm:rsh: final template argv: /usr/bin/ssh orted -mca ess "env" -mca ess_base_jobid "352911360" -mca ess_base_vpid "" -mca ess_base_num_procs "3" -mca orte_node_regex @.***(3)" -mca orte_hnp_uri "352911360.0;tcp://192.168.43.102:56797" --mca plm_base_verbose "100" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "352911360.0;tcp://192.168.43.102:56797" -mca hwloc_base_use_hwthreads_as_cpus "1" -mca pmix "^s1,s2,cray,isolated"
The is later replaced with the hostname. In v4.1.x, any user-specified ssh arguments would have been added after ssh and before per https://github.com/open-mpi/ompi/blob/76d00f65e6f9d8df620b0aa4da49c04e9d3c9f1f/orte/mca/plm/rsh/plm_rsh_module.c#L386-L394
For example:
$ mpirun --version mpirun (Open MPI) 4.1.2rc2
Report bugs to http://www.open-mpi.org/community/help/ $ mpirun --mca plm_rsh_args -x --mca plm_base_verbose 100 --host mpi004,mpi005 uptime>&! out.txt $ grep ssh out.txt [savbu-usnic-a:20233] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [savbu-usnic-a:20233] [[7695,0],0] plm:rsh_setup on agent ssh : rsh path NULL /usr/bin/ssh -x PATH=/home/jsquyres/bogus/bin:$PATH ; export PATH ; LD_LIBRARY ...etc.
Note the -x after ssh.
So if you can run with mpirun --mca plm_rsh_args -x --mca plm_base_verbose 100 ... and see the -x in the ssh command line in the verbose output, but you're still seeing the No protocol specified, then perhaps ssh is being told to use X11 forwarding from something else...? Or perhaps there's something else that's emitting No protocol specified ...?
If you just /usr/bin/ssh HOSTNAME uptime (in the same shell where you see the errant No protocol specified output from mpirun), what output do you see? (filling in the appropriate HOSTNAME, of course)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/7701#issuecomment-929480412, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABM224BBOQJ5MQZTKZOAMO3UEH4VFANCNFSM4M2OQMBA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@jsquyres Could you provide a full command (without the ellipses) so that I can test it? I tried the original MWE script with the following results:
me@myhost:~/source/trouble$ mpirun --mca plm_rsh_args -x --mca plm_base_verbose 100 test.sh
No protocol specified
[myhost:2293201] mca: base: components_register: registering framework plm components
[myhost:2293201] mca: base: components_register: found loaded component isolated
[myhost:2293201] mca: base: components_register: component isolated has no register or open function
[myhost:2293201] mca: base: components_register: found loaded component slurm
[myhost:2293201] mca: base: components_register: component slurm register function successful
[myhost:2293201] mca: base: components_register: found loaded component rsh
[myhost:2293201] mca: base: components_register: component rsh register function successful
[myhost:2293201] mca: base: components_open: opening plm components
[myhost:2293201] mca: base: components_open: found loaded component isolated
[myhost:2293201] mca: base: components_open: component isolated open function successful
[myhost:2293201] mca: base: components_open: found loaded component slurm
[myhost:2293201] mca: base: components_open: component slurm open function successful
[myhost:2293201] mca: base: components_open: found loaded component rsh
[myhost:2293201] mca: base: components_open: component rsh open function successful
[myhost:2293201] mca:base:select: Auto-selecting plm components
[myhost:2293201] mca:base:select:( plm) Querying component [isolated]
[myhost:2293201] mca:base:select:( plm) Query of component [isolated] set priority to 0
[myhost:2293201] mca:base:select:( plm) Querying component [slurm]
[myhost:2293201] mca:base:select:( plm) Querying component [rsh]
[myhost:2293201] mca:base:select:( plm) Query of component [rsh] set priority to 10
[myhost:2293201] mca:base:select:( plm) Selected component [rsh]
[myhost:2293201] mca: base: close: component isolated closed
[myhost:2293201] mca: base: close: unloading component isolated
[myhost:2293201] mca: base: close: component slurm closed
[myhost:2293201] mca: base: close: unloading component slurm
[myhost:2293201] [[43590,0],0] complete_setup on job [43590,1]
0 / 8 waits for 2 seconds
1 / 8 waits for 4 seconds
2 / 8 waits for 2 seconds
3 / 8 waits for 4 seconds
4 / 8 waits for 4 seconds
5 / 8 waits for 2 seconds
6 / 8 waits for 4 seconds
7 / 8 waits for 0 seconds
7 / 8 ready
0 / 8 ready
2 / 8 ready
5 / 8 ready
1 / 8 ready
3 / 8 ready
4 / 8 ready
6 / 8 ready
[myhost:2293201] mca: base: close: component rsh closed
[myhost:2293201] mca: base: close: unloading component rsh
And the output of /usr/bin/ssh myhost uptime
in the same shell:
me@myhost:~/source/trouble$ /usr/bin/ssh myhost uptime
me@myhost's password:
12:54:57 up 106 days, 7:00, 0 users, load average: 0.07, 0.04, 0.03
Results are consistent whether logged into a GUI environment through xrdp
[or using PuTTY
without X11
forwarding or a client side X11
server (sorry, don't have X11 capabilities client side)].
If someone could also explain to me what 'user area is logged' means (as referenced by @rafaeltiveron ), that would be helpful as well.
@YannChemin Did the -x
show up in the debug output as part of the ssh
command? If so, it means that either something else is telling ssh
to use X11 forwarding (i.e., outside of Open MPI -- perhaps an env variable or config file?) or the No protocol specified
message is not from ssh/X11 forwarding. We know that the No protocol specified
message is definitely not coming from Open MPI.
@ke7kto What version of Open MPI are you using? I was just giving elipses in the above command to represent that you can try running anything. You ran with test.sh
, and it shows the No protocol specified
message -- so you did it! 😄 That being said, I notice that you're seeing this message even when you're running on a single machine. So ssh
isn't even used here at all.
Just out of curiosity: can you have your shell startup files emit something when they start and when they complete? E.g., if your shell startup file is .bashrc
, put an echo starting bashrc
right up at the top and a echo ending bashrc
at the end (the specifics of what you need to do may be highly dependent upon your own setup). What I'm wondering here is if there's something being invoked by your shell startup files that is emitting that No protocol...
message (or possibly even your shell shutdown files).
I'm using Open MPI 4.0.3 as installed from the Ubuntu 20.04 LTS distribution. I'm running these commands in an interactive prompt, so adding echo
statements to my bashrc
did nothing. Is there a default script mpirun
looks for to set up its environment that I could try?
This may still be some sort of an X11 related problem, though it doesn't seem to show up in the system logs (at least there were no results for sudo grep -r "No protocol" .
in /var/log
and in ~/.local/share
).
Open MPI doesn't look for any particular script. mpirun
locally fork/exec's the orted
(Open MPI's user-level helper daemon), which then fork/exec's test.sh
.
But test.sh
, which I assume is a shell script, may/will invoke your .bashrc
or somesuch. Hence, I'm wondering if there's some other commands that are getting invoked that are emitting that No protocol...
message. There may be other files that are invoked outside of just $HOME/.bashrc
-- there may be other "dot" files in your $HOME
, and there may also be files in /etc/
that can get invoked automatically.
I have the same problem with a recently compiled version from master... and no clue where it is coming from. Running on the same machine as well. Grepping in sources of ompi and ucx I found nothing.
Here I have disconnected the nodes, so it fails voluntarily, in order to look at places more precisely.
COMMAND
mpirun --mca plm_rsh_args -x --mca plm_base_verbose 100 --use-hwthread-cpus --hostfile /home/mpiuser/cloud/hostfile /home/mpiuser/cloud/hello
RESULT
No protocol specified
[yann-MBP:43244] mca: base: components_register: registering framework plm components
[yann-MBP:43244] mca: base: components_register: found loaded component slurm
[yann-MBP:43244] mca: base: components_register: component slurm register function successful
[yann-MBP:43244] mca: base: components_register: found loaded component isolated
[yann-MBP:43244] mca: base: components_register: component isolated has no register or open function
[yann-MBP:43244] mca: base: components_register: found loaded component rsh
[yann-MBP:43244] mca: base: components_register: component rsh register function successful
[yann-MBP:43244] mca: base: components_open: opening plm components
[yann-MBP:43244] mca: base: components_open: found loaded component slurm
[yann-MBP:43244] mca: base: components_open: component slurm open function successful
[yann-MBP:43244] mca: base: components_open: found loaded component isolated
[yann-MBP:43244] mca: base: components_open: component isolated open function successful
[yann-MBP:43244] mca: base: components_open: found loaded component rsh
[yann-MBP:43244] mca: base: components_open: component rsh open function successful
[yann-MBP:43244] mca:base:select: Auto-selecting plm components
[yann-MBP:43244] mca:base:select:( plm) Querying component [slurm]
[yann-MBP:43244] mca:base:select:( plm) Querying component [isolated]
[yann-MBP:43244] mca:base:select:( plm) Query of component [isolated] set priority to 0
[yann-MBP:43244] mca:base:select:( plm) Querying component [rsh]
[yann-MBP:43244] mca:base:select:( plm) Query of component [rsh] set priority to 10
[yann-MBP:43244] mca:base:select:( plm) Selected component [rsh]
[yann-MBP:43244] mca: base: close: component slurm closed
[yann-MBP:43244] mca: base: close: unloading component slurm
[yann-MBP:43244] mca: base: close: component isolated closed
[yann-MBP:43244] mca: base: close: unloading component isolated
No protocol specified
[yann-MBP:43244] [[58288,0],0] plm:rsh: final template argv:
/usr/bin/ssh -x <template> orted -mca ess "env" -mca ess_base_jobid "3819962368" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "3" -mca orte_node_regex "yann-MBP,worker[1:1,3]@0(3)" -mca orte_hnp_uri "3819962368.0;tcp://192.168.43.134:58779" --mca plm_rsh_args "-x" --mca plm_base_verbose "100" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "3819962368.0;tcp://192.168.43.134:58779" -mca hwloc_base_use_hwthreads_as_cpus "1" -mca pmix "^s1,s2,cray,isolated"
ssh: connect to host worker3 port 22: No route to host
ssh: connect to host worker1 port 22: No route to host
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: yann-MBP
target node: worker1
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
[yann-MBP:43244] mca: base: close: component rsh closed
[yann-MBP:43244] mca: base: close: unloading component rsh
I also tried to include echo calls at start and end of ~/.bashrc and ~/.profile but they do not appear in the verbose listing.
... and to complete, since yesterday all of my computers have this in /etc/ssh/ssh_config:
ForwardX11 no
I did make one mis-statement above: in Open MPI v4.1.x, if mpirun
is launching on the local machine, it fork/exec's the target process(es) directly -- it doesn't launch an intermediary orted
(user-level Open MPI helper daemon; that's only used on remote nodes). For example, this is from running on my own system:
jsquyres 27504 18899 0 07:31 pts/54 00:00:00 mpirun -np 1 /bin/sleep 600
jsquyres 27508 27504 0 07:31 pts/54 00:00:00 /bin/sleep 600
You can see that /bin/sleep
is clearly a direct descendant of the mpirun
process (i.e., /bin/sleep
's PPID is the PID of mpirun
).
Just to be clear: this No protocol specified
message is not a problem, per se. It's weird, and we don't know where it's coming from, but it does not appear to be hindering Open MPI's operation. It's an annoyance, but it doesn't appear to actually stop anything from functioning.
Also, it doesn't look like this message is coming from Open MPI itself. The presumption so far is that something else outside of Open MPI is being invoked that is emitting this message. SSH X11 forwarding was a first educated guess, but perhaps that's not correct. Indeed, seeing this message when running on a single node with no hostfile (per @ke7kto) tends to imply that it's not SSH because Open MPI doesn't use SSH to launch locally. The open question is therefore: what is emitting this message?
When using mpirun
to run shell scripts, $HOME/.bashrc
is one possibility where additional commands might be getting invoked. But there are other shell startup / shutdown files that would probably be worth investigating as well -- it very much depends on your particular system setup. E.g., you could look for:
-
$HOME/.bash*
-
$HOME/.profile
-
/etc/profile
-
/etc/profile.d/*
@ke7kto showed above that doing mpirun ... test.sh
, where test.sh
is presumably a shell script that ended up launching an MPI executable, and where this is running solely on the localhost (and therefore SSH is not involved) still shows the message.
How about this:
-
mpirun mpi_hello_world
(any MPI app will do) to launch it locally with no remote hosts: duplicate @ke7kto's result but with no shell script involved. Do we still see theNo protocol...
message? -
mpirun hostname
-- i.e., launch a non-MPI program on just the localhost / no remote hosts. Do we still see theNo protocol...
message?
@jsquyres test.sh
was a script file with the contents of @philippkraft's original posted test.sh
.
The output of mpirun
with no arguments at all issues No protocol specified
before printing the mpirun could not find anything to do
message. mpirun hostname
and mpirun mpi_app
both issue the warning as well. Running programs that use MPI in the backend results in the No protocol specified
message.
Additionally, I've run the command set
ldd /bin/mpirun | egrep -o '^[^ ]+' | xargs dpkg -S {} | egrep -o '^[^:]+' | uniq | xargs apt-get source
grep -r "No protocol specified" .
with no matches. I assume this probably means that the error message is some sort of composite error message with a {}
or %s
or something, or am I misunderstanding how ldd
works? Are there "sneaky" libraries that might not be loaded initially? I'm not really sure how to proceed.