telemetry-analysis-service Handle gracefully the lack of spot instances

See the cluster at [0]. It spent >70 minutes provisioning. We need to add the ability to turn off spot instances, so that we can guarantee clusters to people.

Longer-term, we need to expand our spot instances. Different instance types, different regions, and more would allow us wider selection of machines and reduce this problem.

[0] https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-1569LDWQXO5A7

Feb 13 '17 18:02 fbertsch

Turning off spot instances is possible with the constance config AWS_USE_SPOT_INSTANCES, admins can edit it at https://analysis.telemetry.mozilla.org/admin/constance/config/. The same applies to our bid, which can be changed via the AWS_SPOT_BID_CORE config.

Gonna close this and create a new ticket to track expanding our use of spot instances.

Feb 14 '17 07:02 jezdez

@vitillo Please don't reopen issues and renaming them if the issue description doesn't match the title anymore. #229 would have been fine to tackle this.

Feb 14 '17 14:02 jezdez

I am re-opening this issue as I don't think that adding different instance types or more regions is necessary the best or only way to solve this issue, which will likely require some more thought.

Some comments:

Our Spark configuration and ETL jobs have been heavily tuned over time to run on c3.4xlarge instances. Introducing new instance types would require a significant amount of manual QA.
Instances on other regions that read data from us-west (S3) incur S3 costs (0.02$ per GB).
Instances on other regions that write data to us-west (S3) incur EC2 costs (0.02$ per GB).
ATMO could request clusters composed partially of on-demand instances (master & core nodes) and partially of spot instances (task nodes). In doing so the cluster will to a ready state rather quickly allowing the analyst to start his work while spot instances are added dynamically when they become available.

Feb 14 '17 14:02 vitillo

@vitillo Ah, that's the info that was missing when you reopened it, the streams have crossed so to say.

Feb 14 '17 14:02 jezdez

@jezdez Yeah; adding support to different instance types (#229) makes a lot of sense for various reasons but I don't think it will necessarily fix this specific problem. This is why I re-opened it.

Feb 14 '17 14:02 vitillo

telemetry-analysis-service telemetry-analysis-service copied to clipboard

Handle gracefully the lack of spot instances

telemetry-analysis-service
telemetry-analysis-service copied to clipboard