mrjob icon indicating copy to clipboard operation
mrjob copied to clipboard

mrjob audit-emr-usage should show bootstrap savings due to cluster sharing

Open coyotemarin opened this issue 6 years ago • 3 comments

Now that Amazon bills by the second rather than the full hour, cluster pooling is not usually a good way to save money. However, it does save you from having to run your bootstrap script (which you have to pay for) again.

When a cluster runs multiple jobs, mrjob audit-emr-usage should track how much time was saved by not having to re-run the bootstrap script, and subtract that from idle time to determine waste (this may be negative, in which case pooling is saving the user money).

This also applies to persistent clusters that people run multiple jobs on manually.

coyotemarin avatar Aug 07 '18 21:08 coyotemarin

Shoot, currently the script doesn't distinguish time spent provisioning the cluster (STARTING state) from time bootstrapping it. This isn't available from DescribeClusters — maybe there's some other way to get that information?

coyotemarin avatar Aug 08 '18 17:08 coyotemarin

ListInstances shows the same ReadyDateTime as the cluster.

coyotemarin avatar Aug 08 '18 17:08 coyotemarin

okay, looks like you use ListInstances and then the EC2 API's DescribeInstances and look at the LaunchTime for each instance in the cluster. It's probably close enough to consider billing to start either at the last LaunchTime before the cluster's ReadyDateTime or 10 minutes after the cluster's CreationDateTime, whichever comes first.

coyotemarin avatar Aug 08 '18 18:08 coyotemarin