telemetry-analysis-service
telemetry-analysis-service copied to clipboard
Handle gracefully the lack of spot instances
See the cluster at [0]. It spent >70 minutes provisioning. We need to add the ability to turn off spot instances, so that we can guarantee clusters to people.
Longer-term, we need to expand our spot instances. Different instance types, different regions, and more would allow us wider selection of machines and reduce this problem.
[0] https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-1569LDWQXO5A7
Turning off spot instances is possible with the constance config AWS_USE_SPOT_INSTANCES
, admins can edit it at https://analysis.telemetry.mozilla.org/admin/constance/config/. The same applies to our bid, which can be changed via the AWS_SPOT_BID_CORE
config.
Gonna close this and create a new ticket to track expanding our use of spot instances.
@vitillo Please don't reopen issues and renaming them if the issue description doesn't match the title anymore. #229 would have been fine to tackle this.
I am re-opening this issue as I don't think that adding different instance types or more regions is necessary the best or only way to solve this issue, which will likely require some more thought.
Some comments:
- Our Spark configuration and ETL jobs have been heavily tuned over time to run on c3.4xlarge instances. Introducing new instance types would require a significant amount of manual QA.
- Instances on other regions that read data from us-west (S3) incur S3 costs (0.02$ per GB).
- Instances on other regions that write data to us-west (S3) incur EC2 costs (0.02$ per GB).
- ATMO could request clusters composed partially of on-demand instances (master & core nodes) and partially of spot instances (task nodes). In doing so the cluster will to a ready state rather quickly allowing the analyst to start his work while spot instances are added dynamically when they become available.
@vitillo Ah, that's the info that was missing when you reopened it, the streams have crossed so to say.
@jezdez Yeah; adding support to different instance types (#229) makes a lot of sense for various reasons but I don't think it will necessarily fix this specific problem. This is why I re-opened it.