nomad
nomad copied to clipboard
Add ReconnectModifyIndex to handle reconnect lifecycle
Closes #14925
This PR fixes a bug where if an allocation with max_client_disconnect
configured is on a node that disconnects, and then the node reconnects, future jobspec changes for that job get ignored until the max_client_disconnect
interval expires. Previous to this change, Allocation.Reconnected
naively just checked the last reconnect event time and the expiry.
This PR:
- Adds a
ReconnectModifyIndex
field to theAllocation
struct. - Updates the alloc runner to update the alloc
ReconnectModifyIndex
when a reconnect is processed by the client - Modifies
Client.allocSync
to send theReconnectModifyIndex
when syncing client managed attributes - Modifies
Node.UpdateAlloc
to persist the incomingReconnectModifyIndex
when generating reconnect evals - Renames
Allocation.Reconnected
toAllocation.IsReconnecting
- Refactors
Allocation.IsReconnecting
to compare theReconnectModifyIndex
to theAllocModifyIndex
to determine if an allocation is reconnecting - Updates all related code to match the new name and test the new logic
- Updates
GenericScheduler.computeJobAllocs
to reset theReconnectModifyIndex
to0
when processingreconnectUpdates
and appends them toPlan.NodeAllocation
so that the updates get persisted
Per our discussion, moving this out of 1.4.2 so that we don't risk rushing it out.
Passing in some feedback from a customer. I think it might be related to this underlying issue since it is max_client_disconnect related, but I am not sure.
I found scenario, where I see duplicated ALLOC_INDEXes in one JOB. Below are required steps (Nomad 1.3.5):
We have job with count=2 and set option max_client_disconnect running as below: NOMAD_ALLOC_INDEX=0 - NodeA NOMAD_ALLOC_INDEX=1 - NodeB
We stop Nomad Agent on NodeA, after that we have temporarily 3 allocations: NOMAD_ALLOC_INDEX=0 - NodeA (Unknown state) NOMAD_ALLOC_INDEX=1 - NodeB NOMAD_ALLOC_INDEX=0 - NodeC
When we start agent on NodeA, then again we have 2 allocations - first was recovered: NOMAD_ALLOC_INDEX=0 - NodeA NOMAD_ALLOC_INDEX=1 - NodeB NodeC - allocation terminated here - as expected
I'm changing count from 2 to 3 and 3rd allocation appears with the same ALLOC_INDEX as the first one!!! NOMAD_ALLOC_INDEX=0 - NodeA NOMAD_ALLOC_INDEX=1 - NodeB NOMAD_ALLOC_INDEX=0 - NodeD
Apart from that, in the state with duplicated ALLOC_INDEXes, EXEC feature in Nomad UI stopped working properly (for affected JOB only)
Does this seem related or should I make a new issue?
I think I understand the problem now 😅
I have an alternative approach in https://github.com/hashicorp/nomad/pull/15068 that I think makes the disconnect/reconnect flows more similar and so easier to understand, but it's still an early work. I will keep investigating the problem to see which solution would be better.
@mikenomitch I think this may be related to this problem. From our docs on NOMAD_ALLOC_INDEX
:
The index is unique within a given version of a job
I think #14925 may prevent the job version from changing, which means you could end up with reused indexes.
But it may be better to open a separate issue just in case. If it's the same problem we can close both issues.
Closing this in favour of #15068.
Thanks for the all the work and guidance on this issue @DerekStrickland!
@lgfa29 I'm glad you found a good solution!
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.