nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Nomad job restart

Open shishir-a412ed opened this issue 3 years ago • 4 comments

@schmichael Thank you for writing the nomad-job-restart-rfc. This was super helpful!

This PR is not ready for review. I am just opening this, so I can put in my understanding of the flow, and ask some questions. We can use this PR as a brainstorming ground to progress on the RFC, and eventually get it in shape to be merged.

Based on nomad/checklist-command This PR only has changes on Code section. No Doc changes are included right now. The PR doesn't have implemented the monitor (and detach) mode where the job restart by default will start in monitor mode. And user can pass -detach to run the command in detach mode.

In phase 1, we ll just implement:

  • nomad job restart -batch-size n -batch-wait d <job_id>
  • nomad job restart -attach <restart_id>
  • nomad job restart -cancel <restart_id>

Pause/resume can come in Phase 2.

My understanding around the changes are as follows:

  • New CLI Command: nomad job restart

  • From the CLI, you jump to HTTP API client code in api/jobs.go.

  • From the HTTP API client, you call Restart which calls into the actual HTTP API defined in command/agent/job_endpoint.go.

  • jobRestart function is called in command/agent/job_endpoint.go. This decodes the HTTP request into API struct api.JobRestartRequest which can then be used to map into the RPC struct structs.JobRestartRequest and send an RPC.

  • Restart RPC is called. This will check permissions on the auth-token, get the snapshot from the state store, get the allocations for the job, Restart all the allocations (Restarting of allocations is not coded yet). Everytime we restart an allocation, we update the modify index and commit this into raft.

  • New raft message type defined: JobRestartRequestType

  • Raft apply here

  • raft log message applied by FSM in nomad/fsm.go

  • This will call applyRestartJob. This will decode the raft message into RPC struct and call into the state store: n.state.UpdateJobRestart

  • State store logic for committing the transaction to boltDB: https://github.com/hashicorp/nomad/blob/334470746e11a395147f4d60252f017321f072af/nomad/state/state_store.go#L969-L989

  • New boltDB table schema defined: https://github.com/hashicorp/nomad/blob/334470746e11a395147f4d60252f017321f072af/nomad/state/schema.go#L101-L120

Like I mentioned above, this patch might not be ready for review yet. It has a bunch of print statement left out. I added those just to test the flow and see if the data is being passed around correctly. E.g When I run this on a test job with 5 allocations, I can print some info like:

Nov 18 20:08:10 linux nomad[1955]:     2021-11-18T20:08:10.489Z [INFO]  client: started client: node_id=3757b276-95bb-1ed4-4fd7-c3169f6d919d
Nov 18 20:08:18 linux nomad[1955]:     2021-11-18T20:08:18.984Z [INFO]  client: node registration complete
Nov 18 20:10:06 linux nomad[1955]: HELLO HELLO: nomad/job_endpoint.go JobRestartRequest: &{ID:58d503f8-2c54-d3f5-4ca3-8484d123206d JobID:count BatchSize:5 BatchWait:10 Status:running RestartedAllocs:[] StartedAt:2021-11-18 20:10:06.373692499 +0000 UTC m=+122.230232254 UpdatedAt:2021-11-18 20:10:06.373692568 +0000 UTC m=+122.230232321 CreateIndex:0 ModifyIndex:0 WriteRequest:{Region:global Namespace:default AuthToken: IdempotencyToken: InternalRpcInfo:{Forwarded:false}}}
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: 4089ce63-50e3-c370-54fe-6ab8a90d647e
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: 910ed5bc-20dc-c027-1a40-e3d9c5d1eb2a
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: c8d77bff-95bb-b4cb-b1d7-1dc2e8f294ef
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: df98f5ce-b716-8e6e-cd48-55c37cd2d1f1
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: fd9661e3-b167-6916-0350-84da97cae0a2
Nov 18 20:10:06 linux nomad[1955]: HELLO nomad/fsm.go: applyRestartJob
Nov 18 20:10:06 linux nomad[1955]: Hello JobRestartRequest object: {ID:58d503f8-2c54-d3f5-4ca3-8484d123206d JobID:count BatchSize:5 BatchWait:10 Status:running RestartedAllocs:[] StartedAt:2021-11-18 20:10:06.373692499 +0000 UTC UpdatedAt:2021-11-18 20:10:06.373692568 +0000 UTC CreateIndex:0 ModifyIndex:0 WriteRequest:{Region:global Namespace:default AuthToken: IdempotencyToken: InternalRpcInfo:{Forwarded:false}}}
Nov 18 20:10:06 linux nomad[1955]: HELLO state/state_store.go: UpdateJobRestart
Nov 18 20:10:06 linux nomad[1955]: HELLO: state/state_store.go: updateJobRestartImpl

Questions:

  1. What would be the best way to restart allocations in the RPC code? (Any code references in the RPC code I could refer to?). Should we signal the restart right before raftApply i.e. for each alloc that we restart, we update the modify index and apply a new log message into raft. My understanding of the indexes is there are two indexes CreateIndex and ModifyIndex. The first time, we ll create the JobRestartObject in the server, we ll update the CreateIndex, and from that point onwards everytime we restart an alloc, ModifyIndex will be updated.

  2. I am still not 100% clear on the boltDB table schema since it only has a key e.g. id in my table. Does that mean we can just store the entire JobRestartRequest Object against that key. If you can take a look at the state store code where I am committing the transaction and let me know if I am missing something.

This is probably a lot to read, and if this doesn't make a lot of sense, let me know. We can also carry some of this conversation back to our internal slack. I just wanted to put the patch out so it's easier to look at some code and go from there.

References: Issue #698

shishir-a412ed avatar Nov 18 '21 22:11 shishir-a412ed

Thanks @shishir-a412ed! As you mentioned, this is still a work in progress so I will convert it to Draft until it's ready for review. I think you will be able to switch it back yourself, but if you can't just let us know 🙂

lgfa29 avatar Dec 21 '21 18:12 lgfa29

Hey @shishir-a412ed, just wanted to check in and see how this was going? Anything blocking?

mikenomitch avatar Feb 01 '22 19:02 mikenomitch

CLA assistant check
All committers have signed the CLA.

hashicorp-cla avatar Mar 12 '22 17:03 hashicorp-cla

FYI, I spoke with @shishir-a412ed, who has a great start on this PR and he probably won't be able to get around to finishing it.

If anybody internal to Nomad team or external wants to take a swing at finishing this off, feel free. Please let us know if you are doing so, so multiple people don't attempt it at the same time!

mikenomitch avatar Oct 19 '22 18:10 mikenomitch

Closing this in favour of #16278.

Thank you so much for the great start @shishir-a412ed!

lgfa29 avatar Mar 01 '23 01:03 lgfa29

This command caused some surprise to me. When I run nomad job restart $JOB_ID, I was expecting each alloc in the job to restart one by one and not causing issue in our running service. I expect restart to work similar with update.

But I was seeing the service was brought down quickly, looks like it restart the alloc without waiting the restarted one back online. And users see error during the restart. Is there anyway for me to make restart graceful?

zhixinwen avatar Jun 21 '24 21:06 zhixinwen

Hi @zhixinwen 👋

This is the expected and documented behaviour of this command. From the docs:

The command waits until the new allocations have client status ready before proceeding with the remaining batches. Services health checks are not taken into account.

I thought there was an open issue to track this feature request, but I can't seem to find it. If you don't mind it could be useful to create on. But implementing it would be a little tricky, since this command runs purely on the machine that called the command, which may not have access to Consul.

Is there anyway for me to make restart graceful?

The -batch-wait flag can help you. If you have a sense of how long it usually takes for the service to be healthy, you could set a fixed timeout. A more complex solution could involve using -batch-wait=ask and monitor the service health out-of-band, sending a response in the job restart command stdin when healthy.

lgfa29 avatar Jun 21 '24 23:06 lgfa29