nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Add ReconnectModifyIndex to handle reconnect lifecycle

Open DerekStrickland opened this issue 2 years ago • 4 comments

Closes #14925

This PR fixes a bug where if an allocation with max_client_disconnect configured is on a node that disconnects, and then the node reconnects, future jobspec changes for that job get ignored until the max_client_disconnect interval expires. Previous to this change, Allocation.Reconnected naively just checked the last reconnect event time and the expiry.

This PR:

  • Adds a ReconnectModifyIndex field to the Allocation struct.
  • Updates the alloc runner to update the alloc ReconnectModifyIndex when a reconnect is processed by the client
  • Modifies Client.allocSync to send the ReconnectModifyIndex when syncing client managed attributes
  • Modifies Node.UpdateAlloc to persist the incoming ReconnectModifyIndex when generating reconnect evals
  • Renames Allocation.Reconnected to Allocation.IsReconnecting
  • Refactors Allocation.IsReconnecting to compare the ReconnectModifyIndex to the AllocModifyIndex to determine if an allocation is reconnecting
  • Updates all related code to match the new name and test the new logic
  • Updates GenericScheduler.computeJobAllocs to reset the ReconnectModifyIndex to 0 when processing reconnectUpdates and appends them to Plan.NodeAllocation so that the updates get persisted

DerekStrickland avatar Oct 18 '22 20:10 DerekStrickland

Per our discussion, moving this out of 1.4.2 so that we don't risk rushing it out.

tgross avatar Oct 24 '22 14:10 tgross

Passing in some feedback from a customer. I think it might be related to this underlying issue since it is max_client_disconnect related, but I am not sure.

I found scenario, where I see duplicated ALLOC_INDEXes in one JOB. Below are required steps (Nomad 1.3.5):

We have job with count=2 and set option max_client_disconnect running as below: NOMAD_ALLOC_INDEX=0 - NodeA NOMAD_ALLOC_INDEX=1 - NodeB

We stop Nomad Agent on NodeA, after that we have temporarily 3 allocations: NOMAD_ALLOC_INDEX=0 - NodeA (Unknown state) NOMAD_ALLOC_INDEX=1 - NodeB NOMAD_ALLOC_INDEX=0 - NodeC

When we start agent on NodeA, then again we have 2 allocations - first was recovered: NOMAD_ALLOC_INDEX=0 - NodeA NOMAD_ALLOC_INDEX=1 - NodeB NodeC - allocation terminated here - as expected

I'm changing count from 2 to 3 and 3rd allocation appears with the same ALLOC_INDEX as the first one!!! NOMAD_ALLOC_INDEX=0 - NodeA NOMAD_ALLOC_INDEX=1 - NodeB NOMAD_ALLOC_INDEX=0 - NodeD

Apart from that, in the state with duplicated ALLOC_INDEXes, EXEC feature in Nomad UI stopped working properly (for affected JOB only)

Does this seem related or should I make a new issue?

mikenomitch avatar Oct 27 '22 18:10 mikenomitch

I think I understand the problem now 😅

I have an alternative approach in https://github.com/hashicorp/nomad/pull/15068 that I think makes the disconnect/reconnect flows more similar and so easier to understand, but it's still an early work. I will keep investigating the problem to see which solution would be better.

lgfa29 avatar Oct 27 '22 21:10 lgfa29

@mikenomitch I think this may be related to this problem. From our docs on NOMAD_ALLOC_INDEX:

The index is unique within a given version of a job

I think #14925 may prevent the job version from changing, which means you could end up with reused indexes.

But it may be better to open a separate issue just in case. If it's the same problem we can close both issues.

lgfa29 avatar Oct 27 '22 22:10 lgfa29

Closing this in favour of #15068.

Thanks for the all the work and guidance on this issue @DerekStrickland!

lgfa29 avatar Nov 02 '22 02:11 lgfa29

@lgfa29 I'm glad you found a good solution!

DerekStrickland avatar Nov 02 '22 09:11 DerekStrickland

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar Mar 03 '23 02:03 github-actions[bot]