Running AMUSE without a (stable) network
(see also #128)
AMUSE requires a stable network connection for MPI to communicate to various workers.
When no network is available, the network is nonstandard (e.g. connecting via a VPN), or the network is unstable, AMUSE doesn't run well.
This probably needs to be addressed in a better way than using a command line workaround (e.g. mpirun --mca btl_tcp_if_include lo0 -n 1 python test.py).
Maybe we can choose the network to be used from within AMUSE in some way? And show the available network options before?
Just to clarify, the problem here is that AMUSE is running all on one machine, and that machine has a network connection to the outside world that is unstable, and even though AMUSE doesn't use that connection it still causes problems?
I can see how that could happen with OpenMPI trying to use every network connection it can find in parallel, including some that don't allow connecting back to the local host, and then that MCA parameter would help by telling it to ignore everything but the loopback interface. It's actually possible to have AMUSE add that option automatically, which would also solve the same problem I'm having with my somewhat exotic networking setup on Ubuntu. Of course, enabling that option by default would also make it impossible to run on multiple nodes of a cluster :smile:.
An option might indeed be to have AMUSE pop up some kind of network configuration tool, but then you don't want it to do that when running on a cluster either. We could inspect the environment and if we don't find any evidence of running inside a SLURM job, assume we're on a single machine and configure MPI accordingly automatically. That would only break if you have a non-SLURM cluster, which are rare these days but may become more common again if Flux starts gaining ground...
I also had some network interruption during the download of the model (twice at ~30Gb) and it is really frustrating to restart from the beginning just because of it. It should support interruption in my opinion.
Again to clarify, this is about the installation process, and in particular about MESA? That would be a different issue, but I see the point.
We'd have to modify the download procedure to check for partially downloaded files, and resume, assuming the server supports that. The upcoming new build system uses wget or curl to download, and those have some options to allow for more retries and longer timeouts I think, so you have a better chance of completing the download even on an unstable connection. Checking for a partial download will require a bit of a hack with make, but it's not impossible. I'll see what I can do there.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 28 days if no further activity occurs. Thank you for your contributions.