mrjob
mrjob copied to clipboard
mrjob audit-emr-usage should show bootstrap savings due to cluster sharing
Now that Amazon bills by the second rather than the full hour, cluster pooling is not usually a good way to save money. However, it does save you from having to run your bootstrap script (which you have to pay for) again.
When a cluster runs multiple jobs, mrjob audit-emr-usage
should track how much time was saved by not having to re-run the bootstrap script, and subtract that from idle time to determine waste (this may be negative, in which case pooling is saving the user money).
This also applies to persistent clusters that people run multiple jobs on manually.
Shoot, currently the script doesn't distinguish time spent provisioning the cluster (STARTING
state) from time bootstrapping it. This isn't available from DescribeClusters
— maybe there's some other way to get that information?
ListInstances
shows the same ReadyDateTime
as the cluster.
okay, looks like you use ListInstances
and then the EC2 API's DescribeInstances
and look at the LaunchTime
for each instance in the cluster. It's probably close enough to consider billing to start either at the last LaunchTime
before the cluster's ReadyDateTime
or 10 minutes after the cluster's CreationDateTime
, whichever comes first.