Cook icon indicating copy to clipboard operation
Cook copied to clipboard

Proposal: add support for rewrite-for-retry scripts

Open pburka opened this issue 7 years ago • 2 comments

The sim memory-model does a great job of predicting memory requirements most of the time, but when it gets it wrong, simulation pieces run slowly until they eventually fail. Unfortunately, the default behavior of cook is to run these pieces again. This is very unlikely to succeed, and is particularly wasteful, as these are usually stragglers in the job group to begin with.

Ideally, if we know that a job failed due to OOM, we should retry the job with more memory. If we were able to do this, we would increase the number of sims which succeed without user intervention, and improve utilization on the simfarm by reducing the number of failed jobs. We could also tune the memory model to use smaller heap sizes for all jobs (i.e. reduce the margin of error), confident in the knowledge that, when the model is wrong, the sim will quickly be retried with more memory. This might reduce average job memory requirements by 5-10%, effectively increasing simfarm capacity by that amount. We might also decrease the tolerance for excessive GC in sim jobs, allowing them to fail faster so they can be quickly restarted with more memory.

I propose that this can be done generically and extensibly by adding a 'rewrite-for-retry script' to the job descriptions submitted to the server. If the job fails, cook will attempt to invoke the rewrite script on the host where the job failed. The script would have an opportunity to examine the failed job to determine what, if any, changes should be made on retry. For sims, the job would examine logs to detect OOM and propose a larger memory requirement to cook, and a new command line with larger -Xmx and -Xms options. If the script failed (e.g. because the job host is borked), cook will just retry the job unchanged.

While the initial proposal is to increase memory size, rewrite-for-retry scripts would have freedom to make any number of changes on retry. It might even examine the logs and determine that the failure was due to an unrecoverable error, and elect not to waste resources rerunning the job at all.

Support for rewrite-for-retry scripts could help us improve simulation success rates while increasing effective simfarm capacity.

pburka avatar Apr 06 '17 22:04 pburka

I think Cook already has support for the two primitives required to build something like this

  1. the ability to run code on task completion and in the same container as the primary job (this Is the cook finish API )
  2. the ability to update a jobs parameters e.g. Men or CPU

Wil Y - make sense ?

Sent from my iPhone

On Apr 6, 2017, at 6:25 PM, Peter Burka [email protected] wrote:

The sim memory-model does a great job of predicting memory requirements most of the time, but when it gets it wrong, simulation pieces run slowly until they eventually fail. Unfortunately, the default behavior of cook is to run these pieces again. This is very unlikely to succeed, and is particularly wasteful, as these are usually stragglers in the job group to begin with.

Ideally, if we know that a job failed due to OOM, we should retry the job with more memory. If we were able to do this, we would increase the number of sims which succeed without user intervention, and improve utilization on the simfarm by reducing the number of failed jobs. We could also tune the memory model to use smaller heap sizes for all jobs (i.e. reduce the margin of error), confident in the knowledge that, when the model is wrong, the sim will quickly be retried with more memory. This might reduce average job memory requirements by 5-10%, effectively increasing simfarm capacity by that amount. We might also decrease the tolerance for excessive GC in sim jobs, allowing them to fail faster so they can be quickly restarted with more memory.

I propose that this can be done generically and extensibly by adding a 'rewrite-for-retry script' to the job descriptions submitted to the server. If the job fails, cook will attempt to invoke the rewrite script on the host where the job failed. The script would have an opportunity to examine the failed job to determine what, if any, changes should be made on retry. For sims, the job would examine logs to detect OOM and propose a larger memory requirement to cook, and a new command line with larger -Xmx and -Xms options. If the script failed (e.g. because the job host is borked), cook will just retry the job unchanged.

While the initial proposal is to increase memory size, rewrite-for-retry scripts would have freedom to make any number of changes on retry. It might even examine the logs and determine that the failure was due to an unrecoverable error, and elect not to waste resources rerunning the job at all.

Support for rewrite-for-retry scripts could help us improve simulation success rates while increasing effective simfarm capacity.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

tnn1t1s avatar Apr 07 '17 00:04 tnn1t1s

FYI, we have a feature in our 2018 roadmap for addressing the out-of-memory issue, but it's not going to be as generic as the feature described here.

DaoWen avatar Mar 31 '18 16:03 DaoWen