nomad
nomad copied to clipboard
Nomad job restart
@schmichael
Thank you for writing the nomad-job-restart-rfc
. This was super helpful!
This PR is not ready for review. I am just opening this, so I can put in my understanding of the flow, and ask some questions. We can use this PR as a brainstorming ground to progress on the RFC, and eventually get it in shape to be merged.
Based on nomad/checklist-command
This PR only has changes on Code
section. No Doc
changes are included right now.
The PR doesn't have implemented the monitor
(and detach
) mode where the job restart
by default will start in monitor
mode. And user can pass -detach
to run the command in detach
mode.
In phase 1, we ll just implement:
-
nomad job restart -batch-size n -batch-wait d <job_id>
-
nomad job restart -attach <restart_id>
-
nomad job restart -cancel <restart_id>
Pause/resume
can come in Phase 2.
My understanding around the changes are as follows:
-
New CLI Command:
nomad job restart
-
From the CLI, you jump to HTTP API client code in
api/jobs.go
. -
From the HTTP API client, you call
Restart
which calls into the actual HTTP API defined incommand/agent/job_endpoint.go
. -
jobRestart
function is called incommand/agent/job_endpoint.go
. This decodes the HTTP request into API structapi.JobRestartRequest
which can then be used to map into the RPC structstructs.JobRestartRequest
and send an RPC.- Following structs are defined.
- API request (I didn't define the response yet. Still contemplating what would be needed in the API response struct? Do we just send the UUID back to the user?).
api.JobRestartRequest
- RPC Request struct:
structs.JobRestartRequest
- RPC Response struct
structs.JobRestartResponse
- API request (I didn't define the response yet. Still contemplating what would be needed in the API response struct? Do we just send the UUID back to the user?).
- Following structs are defined.
-
Restart RPC
is called. This will check permissions on the auth-token, get the snapshot from the state store, get the allocations for the job, Restart all the allocations (Restarting of allocations is not coded yet). Everytime we restart an allocation, we update the modify index and commit this into raft. -
New raft message type defined:
JobRestartRequestType
-
Raft apply
here
-
raft log message applied by FSM in
nomad/fsm.go
-
This will call
applyRestartJob
. This will decode the raft message into RPC struct and call into the state store:n.state.UpdateJobRestart
-
State store logic for committing the transaction to boltDB: https://github.com/hashicorp/nomad/blob/334470746e11a395147f4d60252f017321f072af/nomad/state/state_store.go#L969-L989
-
New boltDB table schema defined: https://github.com/hashicorp/nomad/blob/334470746e11a395147f4d60252f017321f072af/nomad/state/schema.go#L101-L120
Like I mentioned above, this patch might not be ready for review yet. It has a bunch of print statement left out. I added those just to test the flow and see if the data is being passed around correctly. E.g When I run this on a test job with 5 allocations, I can print some info like:
Nov 18 20:08:10 linux nomad[1955]: 2021-11-18T20:08:10.489Z [INFO] client: started client: node_id=3757b276-95bb-1ed4-4fd7-c3169f6d919d
Nov 18 20:08:18 linux nomad[1955]: 2021-11-18T20:08:18.984Z [INFO] client: node registration complete
Nov 18 20:10:06 linux nomad[1955]: HELLO HELLO: nomad/job_endpoint.go JobRestartRequest: &{ID:58d503f8-2c54-d3f5-4ca3-8484d123206d JobID:count BatchSize:5 BatchWait:10 Status:running RestartedAllocs:[] StartedAt:2021-11-18 20:10:06.373692499 +0000 UTC m=+122.230232254 UpdatedAt:2021-11-18 20:10:06.373692568 +0000 UTC m=+122.230232321 CreateIndex:0 ModifyIndex:0 WriteRequest:{Region:global Namespace:default AuthToken: IdempotencyToken: InternalRpcInfo:{Forwarded:false}}}
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: 4089ce63-50e3-c370-54fe-6ab8a90d647e
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: 910ed5bc-20dc-c027-1a40-e3d9c5d1eb2a
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: c8d77bff-95bb-b4cb-b1d7-1dc2e8f294ef
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: df98f5ce-b716-8e6e-cd48-55c37cd2d1f1
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: fd9661e3-b167-6916-0350-84da97cae0a2
Nov 18 20:10:06 linux nomad[1955]: HELLO nomad/fsm.go: applyRestartJob
Nov 18 20:10:06 linux nomad[1955]: Hello JobRestartRequest object: {ID:58d503f8-2c54-d3f5-4ca3-8484d123206d JobID:count BatchSize:5 BatchWait:10 Status:running RestartedAllocs:[] StartedAt:2021-11-18 20:10:06.373692499 +0000 UTC UpdatedAt:2021-11-18 20:10:06.373692568 +0000 UTC CreateIndex:0 ModifyIndex:0 WriteRequest:{Region:global Namespace:default AuthToken: IdempotencyToken: InternalRpcInfo:{Forwarded:false}}}
Nov 18 20:10:06 linux nomad[1955]: HELLO state/state_store.go: UpdateJobRestart
Nov 18 20:10:06 linux nomad[1955]: HELLO: state/state_store.go: updateJobRestartImpl
Questions:
-
What would be the best way to restart allocations in the RPC code? (Any code references in the RPC code I could refer to?). Should we signal the restart right before
raftApply
i.e. for each alloc that we restart, we update the modify index and apply a new log message into raft. My understanding of the indexes is there are two indexesCreateIndex
andModifyIndex
. The first time, we ll create theJobRestartObject
in the server, we ll update theCreateIndex
, and from that point onwards everytime we restart an alloc,ModifyIndex
will be updated. -
I am still not 100% clear on the boltDB table schema since it only has a
key
e.g.id
in my table. Does that mean we can just store the entireJobRestartRequest
Object against that key. If you can take a look at the state store code where I am committing the transaction and let me know if I am missing something.
This is probably a lot to read, and if this doesn't make a lot of sense, let me know. We can also carry some of this conversation back to our internal slack. I just wanted to put the patch out so it's easier to look at some code and go from there.
References: Issue #698
Thanks @shishir-a412ed! As you mentioned, this is still a work in progress so I will convert it to Draft
until it's ready for review. I think you will be able to switch it back yourself, but if you can't just let us know 🙂
Hey @shishir-a412ed, just wanted to check in and see how this was going? Anything blocking?
FYI, I spoke with @shishir-a412ed, who has a great start on this PR and he probably won't be able to get around to finishing it.
If anybody internal to Nomad team or external wants to take a swing at finishing this off, feel free. Please let us know if you are doing so, so multiple people don't attempt it at the same time!
Closing this in favour of #16278.
Thank you so much for the great start @shishir-a412ed!
This command caused some surprise to me.
When I run nomad job restart $JOB_ID
, I was expecting each alloc in the job to restart one by one and not causing issue in our running service. I expect restart to work similar with update.
But I was seeing the service was brought down quickly, looks like it restart the alloc without waiting the restarted one back online. And users see error during the restart. Is there anyway for me to make restart graceful?
Hi @zhixinwen 👋
This is the expected and documented behaviour of this command. From the docs:
The command waits until the new allocations have client status
ready
before proceeding with the remaining batches. Services health checks are not taken into account.
I thought there was an open issue to track this feature request, but I can't seem to find it. If you don't mind it could be useful to create on. But implementing it would be a little tricky, since this command runs purely on the machine that called the command, which may not have access to Consul.
Is there anyway for me to make restart graceful?
The -batch-wait
flag can help you. If you have a sense of how long it usually takes for the service to be healthy, you could set a fixed timeout. A more complex solution could involve using -batch-wait=ask
and monitor the service health out-of-band, sending a response in the job restart
command stdin when healthy.