HpBandSter icon indicating copy to clipboard operation
HpBandSter copied to clipboard

Running in AWS

Open ifed-ucsd opened this issue 6 years ago • 6 comments

I tried running this codebase in AWS but encountered an issue with Pyro not working across several AWS instances (i.e. one instance running the dispatcher and many instances acting as workers). It seems there is a communication issue. Have you ever encountered anything like this? If so, do you have any suggestions?

I really appreciate all of your help!

ifed-ucsd avatar Feb 09 '19 02:02 ifed-ucsd

Personally, I have not tried running things across AWS instances, but here are some general remarks when running it on any distributed setup:

  1. make sure that the dispatcher and the every worker can reach the nameserver
  2. make sure every process running uses the correct 'host', i.e. it tries to use the network interface connected to the (in your case) internet and not, e.g., 127.0.0.1.
  3. increase the logging level to debug and see where it hangs. You could post those logs here and I can help you figure it out.
  4. make sure the nameserver is running before any other process is started.
  5. make sure they all use the same 'run_id'

Hope that helps. Let me know if you have any further questions.

sfalkner avatar Feb 11 '19 20:02 sfalkner

Did you ever get it to work?

sfalkner avatar Feb 20 '19 18:02 sfalkner

Hi Stefan. Sorry I forgot to get back to you on this. I spent quite some time on it but was never able to get it fully working. I was able to spin up the master and workers, and the workers were able to communicate with the master and receive jobs, but then some random fraction of the workers would die. There were no error messages in the logs, other than the master complaining about dead workers. I suspect it has something to do with how AWS manages network traffic.

ifed-ucsd avatar Feb 20 '19 19:02 ifed-ucsd

That is unfortunate, sorry to hear that. Idk if it helps, but you could try to see if somebody has done communication based on Pyro4 (https://pythonhosted.org/Pyro4/) on AWS. This package is used internally to handle all the network communication between the master and the worker. Maybe one just needs to adjust some parameter to make it work more robustly (like the number of reconnections before failure or something along those lines).

sfalkner avatar Feb 20 '19 19:02 sfalkner

Thanks Stefan. I'm aware of the pyro4 backend and did lots of experimentation with the settings to try to get it to run on aws. I think the fundamental problem is aws doesn't permit the amount of network traffic required for pyro to run (i.e. I think the constant master-worker communication violates some sort of aws rules).

ifed-ucsd avatar Feb 20 '19 19:02 ifed-ucsd

Unfortunately, I have no experience with AWS, so I can't really help beyond just some random guessing, but maybe it could be an option to only have the workers running on AWS and have the master run somewhere else. That would effectively cut the communication in half. If you limit the info_dict returned by compute to contain very little (you could still store everything to disk and collect it later), you might be able to limit communication even further. Is there maybe some support team for AWS one could ask about any limitations on the network traffic?

sfalkner avatar Feb 20 '19 19:02 sfalkner