mrjob icon indicating copy to clipboard operation
mrjob copied to clipboard

clean auto-termination

Open coyotemarin opened this issue 11 years ago • 5 comments

mrjob can currently set up job flows so that the instances automatically shut themselves down when the job flow has been idle for a certain amount of time.

However, when we do this, EMR lists the job flow as FAILED, even when the job succeeded, which confuses people. I'd like to turn on auto-termination and pooling by default in v0.5.0, but it needs to be cleaner.

I think the current script should, before shutting down whatever node it's on, find the AWS credentials in the Hadoop configs, connect to the appropriate EMR endpoint, and terminate the job flow. mrjob can pass in the endpoint if need be, but the script needs to be able to figure out the job flow ID (since the arguments to the bootstrap script are specified at the same time the job flows is created).

I bet we could get this down to 20-30 lines of Python if we worked at it; if we want to use boto, we should copy the appropriate code, rather than installing the library.

coyotemarin avatar Dec 12 '13 19:12 coyotemarin

Can't find job flow ID (so far), but I figured out how to get the EC2 instance ID (basically wget/curl http://169.254.169.254/latest/meta-data/instance-id). We can then match this up with MasterInstanceId from the EMR API's DescribeJobFlows.

coyotemarin avatar Dec 12 '13 20:12 coyotemarin

I know this is a pretty old issue, but I am still seeing the behavior where clusters that are used for pooling show up as failed. Is this still an issue with mrjob, or could it be something about our configuration?

jdavidheiser avatar Aug 04 '17 19:08 jdavidheiser

The main issue here is that EMR clusters generally aren't run with the right IAM permissions to terminate EMR clusters (including themselves).

coyotemarin avatar Sep 08 '18 02:09 coyotemarin

The cluster ID is in /var/aws/emr/userData.json. You can run something like this:

$ aws emr terminate-clusters --cluster-id $(jq -r .clusterId /var/aws/emr/userData.json)

to grab the cluster ID and ask to terminate the cluster, but if you're using the default instance profile, you'll get an error like:

An error occurred (AccessDeniedException) when calling the TerminateJobFlows operation: User: arn:aws:sts::333333333333:assumed-role/mrjob-93ede2238d7e1d8b/i-0583f0abf36f213ae is not authorized to perform: elasticmapreduce:TerminateJobFlows on resource: arn:aws:elasticmapreduce:us-west-2:333333333333:cluster/j-20T3TQX661LJ6

coyotemarin avatar Sep 08 '18 02:09 coyotemarin

mrjob creates its own IAM instance and service profile roles, so we could conceivably add elasticmapreduce:TerminateJobFlows. The idle timeout script could attempt to run aws emr terminate-clusters and fall back to sudo shutdown -h now.

coyotemarin avatar Sep 08 '18 02:09 coyotemarin