nomad allow `scaling` ([0-1]) for `system` job to allow for `job start` semantics

Proposal

As of today, there is no job start construct.

Some workflows need some jobs to be "stopped", perform some other activity, and then "start" the jobs

For service or periodic jobs I achieve this by scaling the job down to 0 when needed and then scaling it back to 1 once I want to "start" them.

I need the same facility for system jobs (I can live with the exception that system jobs are strictly limited to the range [0-1])

Use-cases

The issue is that our Nomad jobs are Terraformed, hence "re-running" of jobs would involve running automated terraform apply. (We want to avoid having to do that)

There is some activity which needs a "fleet-wide" system job to be stopped for some time and then started after the other work is complete.

Rewriting the system job as a service job is a stretch possibility, but would be an unnecessary burden.

Attempted Solutions

The only current way I see is to somehow run an automated terraform apply in the specific directory.

If there is any other way to do this, that could also be helpful too!

Feb 14 '22 09:02 shantanugadgil

@shantanugadgil I'm unclear why nomad job stop doesn't work as it exists today for stopping all allocs of a system job? Can you get into a little more detail?

Feb 14 '22 13:02 tgross

@tgross the job stop is not the problem. 'starting' the stopped job is the problem. To start the job again, I have to "run" the job file, correct?

I could use the trick mentioned here: https://github.com/hashicorp/nomad/issues/11077#issuecomment-909336297

... but was wondering if the range of [0,1] seems more logical (less calls from cmdline for me :grin: )

Feb 14 '22 14:02 shantanugadgil

Yeah, I'd rather implement job start (which I'm already not wild about) than overload the meaning of scaling for this case.

Feb 14 '22 14:02 tgross

Hey @shantanugadgil, we were discussing this request a bit as a team.

We decided that thinking about something like job pause and job start is a fine idea, but it probably would be a separate thing from scaling. For instance, you might want to pause your job with a count of N, and then restart it with the same count (and not have to keep track of your previous count).

Can you provide a bit more detail on why re-terraforming is an issue? Is it just that the terraform config is large and takes time? Or is there some other technical reason why that's an issue on the Nomad side?

And is avoiding re-running TF the primary reason for wanting job pause/start?

Just want to make sure we aren't overlooking anything!

Feb 22 '22 17:02 mikenomitch

Hi @mikenomitch

There are quite a few things to unpack here :slightly_smiling_face: I'll try to answer the questions as much as possible in detail, in separate sections.

We decided that thinking about something like job pause and job start is a fine idea, but it probably would be a separate thing from scaling. For instance, you might want to pause your job with a count of N, and then restart it with the same count (and not have to keep track of your previous count).

This is by far the ideal way I would like to use the cmdline ('job pause', do my other activity, 'job resume') Though, I see the technical challenges of what a "pause" should mean and the subsequent questions it brings to my mind:

does pause mean SIGTSTP?
what about child processes?
does the software itself support freezing/thawing?, etc.

I wouldn't mind if "pause" internally made the count 0 (remembering the initial count), and a resume would start with the previous count.

Can you provide a bit more detail on why re-terraforming is an issue? Is it just that the terraform config is large and takes time? Or is there some other technical reason why that's an issue on the Nomad side?

And is avoiding re-running TF the primary reason for wanting job pause/start?

The main reason for wanting to avoid terraforming the said system job is to avoid any other infrastructural side effect which might come in as currently infra (asg ,etc.) is in the same directory as the job definition (I know this should not occur in a well maintained environment, but y'know ... :grimacing: )

We could separate the infrastructure and the jobs to prevent accidental side effects, but that would be change to the current code layout.

We could also do -target based Terraform runs, but I am not a fan of using -target in automation :)

Just want to make sure we aren't overlooking anything!

All said and done, I have a basic question of how to implement stop/start of a system job at all? (until pause/resume may get implemented) i.e. "how do I stop and start a system job cleanly?"

A1: Even if I were to use Terraform automation, I can't set the count to 0. Then the option could be to terraform destroy the job (only the job), but then that loses all history of the job.

I know, I could "stop -purge" using HTTP API and then only to a terraform apply, but at this point things are looking quite messy to me! :( :(

The nicest advantage of "allowing system jobs to scale between 0 and 1" gives me is the ability to do this using simple things like curl using shell code ... and we already use this mechanism for our existing batch jobs:

pause == 'make count 0'
unpause == 'make count 1'

Feb 23 '22 07:02 shantanugadgil

Since this was last discussed we actually implemented a definition of a "paused" alloc, but it's different than the implied behavior discussed here (a "paused" job). Not only have we not made progress on defining what a paused job would look like, but using "paused" for allocs probably means we need to rethink what we'd even call a registered-but-not-running job.

We've gotten other feedback that the scale to [0-1] for system jobs was useful (NET-9976), and I'm not sure we shouldn't just do that. Looking at the root issue here (#16963) I think preventing scale-to-0 was a side effect of the fix and not an intentional design.

I know in the past I've resisted the idea of using Nomad as a "job catalog" that represents both running and not-running workloads, but it does seem like a useful pattern and we have no alternative solution.

No promises on roadmap or timeline. Just wanted to give a +1 to scale-system-jobs-to-0 and restart the conversation.

Oct 03 '24 19:10 schmichael

The use case of [0-1] is very much there. It makes sens for not only system jobs but also batch type of jobs.

The motivation is the same; "not having to resubmit the job"

Oct 21 '24 13:10 shantanugadgil

Hi @shantanugadgil I wanted to let you know we are actively working on this one

Nov 05 '24 09:11 Juanadelacuesta