levant Add option to wait until batch jobs *complete* prior to returning

Add option to wait until batch jobs complete prior to returning

Open dansteen opened this issue 6 years ago • 16 comments

Currently levant only waits until the job has reached a "running" state. But it would be great (for use in scripts) to be able to, optionally, wait until it has completed. Then levant could return with the appropriate error code or success depending on whether the job failed or not.

Thanks!

Jan 02 '18 21:01 dansteen

Thanks for the report @dansteen. That sounds like a sensible idea to me; I am thinking it would need to check that the job does not have a periodic stanza otherwise it would be endless.

Jan 02 '18 21:01 jrasell

yep, that make sense. Thanks!

Jan 02 '18 21:01 dansteen

This is something that I need for my work. If you have no objections, I'd like to start working on it.

Thanks!

Jan 09 '18 17:01 dansteen

Hey @dansteen I would like to just take a day to get my thoughts straight as I do have concerns about this sitting in Levant which is aimed to be a template and deployment helper rather than a batch job tracker. I am really busy currently, so if I could reply tomorrow with some organised thoughts I would appreciate it.

Jan 09 '18 17:01 jrasell

np at all. I didn't mean to be pushy about it. By way of rational (once you are ready to think about it), we have a deploy process that uses multiple jobs to deploy an application. These jobs need to be done in order, and if any previous job fails, the subsequent ones should not run. The jobs are typically comprised of a combination of Batch and Service jobs, and are things like migrations (batch), asset placement (batch), service restart (service), and cache flush (batch). Nomad does not have a built in way to do job orchestration, so we have a script that runs though each job and runs it if the previous job has succeeded.

My goal is to use levant as the part of that script that handles actual interaction with nomad. However, this will only work if levant will wait for the batch jobs to succeed or fail. Otherwise we will start running the next part of the job prior to the previous one finishing.

Jan 09 '18 17:01 dansteen

not at all pushy, I just wanted to give you an update.

I have had some time to think about it "properly" and I would like to discuss the option of adding a new command to Levant named something like "batch-mon" (please help with the name). This allows Levant to keep a clear distinction between deployments and job tracking whilst still adding the functionality. I am also a firm believer of keeping pipelines distinct and in this scenario you could have the following:

pipeline 1 deploy job 1, if success run pipeline 2
pipeline 2 monitor batch job; if success run pipeline 3
pipeline 3 deploy job 2; if success run pipeline 4
rinse and repeat

I would appreciate any feedback or thoughts you have on this approach.

Jan 11 '18 10:01 jrasell

Having segregated functionality is certainly a good way to go forward. I like your pipline idea for how the code should be organized, and I definitely like the idea of being able to handle multiple jobs in a single run.

I could see a problem with having a separate command for this though. What would you do in a case (as described above) where multiple jobs are run, some of which are batch jobs and some of which are service jobs? If you have a separate command that sort of mixing wouldn't make sense. Having a flag like "--monitor-batch" might be a better way to go for this.

In general, for the sake of automation, I think having the command line be as job-type agnostic as possible is a good thing. Otherwise the automation wrapper will have to know what types of jobs its running in order to formulate the correct levant command line.

I actually have an implementation of this that I put together already that demonstrates this functionality (I needed something in the meantime). I wrote most of it prior to seeing your comments above, so it doesn't encorporate the pipeline concept, but we can certainly modify it in any way that makes sense depending on how we move forward. See PR #85

Jan 12 '18 16:01 dansteen

One additional note on the pull request is that, for batch jobs, in a case where the job is re-run but not modified, and the original job is still running, nomad will not restart the job - it just keeps running the original one. This means that it does not create new allocations based on the Evaluation of the new run. The upshot of this is that if an existing job is in a restart cycle, and you re-run it as-is, no allocation log messages will be reported if levant timesout ( using the timeout function added in the PR above). This is something we can fix, but is a bit painful.

Jan 12 '18 16:01 dansteen

could you provide a pseudo example for I could see a problem with having a separate command for this though. What would you do in a case (as described above) where multiple jobs are run, some of which are batch jobs and some of which are service jobs? If you have a separate command that sort of mixing wouldn't make sense. Having a flag like "--monitor-batch" might be a better way to go for this. so I can understand better. I have a response but I want to make sure I am not missing anything. if you're on gitter feel free to ping me on there, as it might be a little easier.

Jan 12 '18 16:01 jrasell

In regard to turning this into a separate command. How did you want the command to be informed? Do you want the user to pass in a job name or an evaluationID? Using a Job Name is more intuitive, but subject to mushiness since the latest evaluation for that job may or may not be the one that you just inserted. Using the EvaulationID is more precise, but you then need the output from the "deploy" command in order to effectively run the "monitor" command. From an automation perspective, this is problematic until we get https://github.com/jrasell/levant/issues/24 in.

Perhapse the automation part could be mitigated by allowing the json output of the "deploy" command (from https://github.com/jrasell/levant/issues/24) to be piped into the input of the "monitor" command. But then we will need to print all log messages to stderr or something like that....

Jan 12 '18 18:01 dansteen

@dansteen if all log output is in json, you can parse the output and decide based on log attributes as to whether it's Levant output or deploy-related output.

Jan 15 '18 23:01 josegonzalez

@josegonzalez That's true. Do you intend to buffer all output until the deploy has completed and then output a json document containing all the logs that levant has generated during its run? I kind of like getting messages as things progress. Especially since a deploy can take several minutes...

Jan 21 '18 19:01 dansteen

In our case, our python wrapper just grabs each newline (which is a single log line) as it happens, and outputs the parsed version on the fly (also storing all output in the db).

Jan 21 '18 20:01 josegonzalez

@josegonzalez got it. So you basically want an --output json style option that will output messages just like now, except that each line is wrapped in json along with some metadata elements.

Jan 24 '18 00:01 dansteen

Yep that would be your best bet, then any tool that triggers levant can just poll for each newline as output, parse it as json, and decide what to do with it.

Jan 24 '18 15:01 josegonzalez

A lot of improvements have been made to nomad since this issue.

Did anyone find a workaround for this, to wait for batch jobs to succeed? The linked #141 seems stale and levant is still lacking HCL2 support.

Sep 28 '21 13:09 EtienneBruines

levant levant copied to clipboard

Add option to wait until batch jobs *complete* prior to returning

levant
levant copied to clipboard

Add option to wait until batch jobs complete prior to returning