ecs-watchbot icon indicating copy to clipboard operation
ecs-watchbot copied to clipboard

Try machine learning distribution on ecs-watchbot

Open jakepruitt opened this issue 7 years ago • 2 comments

Context

Talking with the @mapbox/ml-club today, it sounds like running training on multiple hosts is still unexplored, and could provide some benefits to the difficulties involved in running single hosts for days on end.

Thoughts

I'm not sure if this belongs in ecs-watchbot or in ecs-api, but it looks like https://github.com/uber/horovod is a potential way to try out distributing a machine learning system across multiple hosts.

The connection takes place through TCP, so maybe ENI's and named DNS records/service discoverability would help here.

cc/ @mapbox/ml-club @mapbox/platform

jakepruitt avatar Feb 01 '18 19:02 jakepruitt

This would be really cool to explore -- but might be worth waiting until ECS rolls out their upcoming service discovery system. From the sound of it, that system will make it far easier to manage the IP addresses, DNS entries, and healthchecking that's usually needed for this kind of cross-node communication.

Our last communication with the team put the launch of this feature in late Feb / early March.

rclark avatar Feb 02 '18 17:02 rclark

Since service discovery is now available, I think we can start experimenting here (refs https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-discovery.html). Looks like there's even cloudformation support: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-servicediscovery-service.html.

From the API standpoint, I think it'd make sense to have service discovery be an option during template creation. Then within the watchbot listen code, we could poll the Route53 record for the service and internally keep the list of IP's or IP:Port combos of all of the containers in the service. Then, we could inject this list as a comma-separated environment variable to the worker.

jakepruitt avatar Jun 21 '18 18:06 jakepruitt