alpine-mpich icon indicating copy to clipboard operation
alpine-mpich copied to clipboard

all service containers of a service are not fetched in /etc/opts/hosts file

Open ghost opened this issue 7 years ago • 4 comments

I have created a service with 16 containers and running an MPI task from the master node. I have noticed that not all the service containers are taking the load. Then I opened the /etc/opts/hosts file which is supposed to have a list of all service containers but I found most of the time 2-3 containers are not listed in it.

I have figured it out that this is an issue with "netstat -t" command inside get_hosts, which can not resolve all containers name and hence returning fewer addresses most of the time.

ghost avatar Aug 27 '17 16:08 ghost

Are you using the Single Host or Multi Host orchestration? and what is the version of Docker?

I notice in the Multi Host solution, the availability of all services is sometimes late, and I have to rerun the commands to get them all up.

Any alternative suggestion to netstat -t is welcome. At some point I'll look into the new Docker (haven't checked since January but heard some big noise in the Summer) to see what's been updated that can provide better solution to this topic.

NLKNguyen avatar Aug 29 '17 18:08 NLKNguyen

I am using multiple host and docker version is 1.16.0 "netstat" is slow and it not picking all the containers address. I made a local script which prepares the list of hosts and scp the file into the master container before login and starting the mpi task from inside. "docker service ps --no-trunc master-service-name" "docker service ps --no-trunc worker-service-name" commands gives all required literals to prepre the hostFile.

I did it in java/python but to keep your project as it is, it will be better to use another shell script to populate the same.

ghost avatar Aug 30 '17 16:08 ghost

I noticed similar issues while running MPI jobs. Some of the worker nodes occasionally get lost from the /etc/opts/hosts. It won't cause problems when running a short MPI job, but it will hang there forever for some longer jobs.

Any ideas to bring the hanging jobs back?

lzhou-arch avatar Apr 07 '18 15:04 lzhou-arch

This might be a similar issue to #4 and netstat.

I've produced a solution using dig based on https://stackoverflow.com/questions/49446165/how-to-get-all-ip-addresses-on-a-docker-network

I'll make a pull request.

simonholgate avatar Mar 08 '19 18:03 simonholgate