pacemaker Fix: libpacemaker: Don't shuffle anonymous clone instances unnecessarily

Fix: libpacemaker: Don't shuffle anonymous clone instances

Currently, anonymous clone instances may be shuffled under certain conditions, causing an unnecessary resource downtime when an instance is moved away from its current running node.

For example, this can happen when a stopped promotable instance is scheduled to promote and the stickiness is lower than the promotion score (see the promotable-anon-recover-promoted test). Instance 0 gets allocated first and goes to the node that will be promoted, causing it to relocate if it's already running somewhere else.

There are also some other corner cases that can trigger shuffling, like the one in the clone-anon-no-shuffle-constraints test.

My solution is to wait until all instance allocations are finished, and then to swap allocations to keep as many instances as possible on their current running nodes.

This commit also updates several tests that contain unnecessary instance moves.

I'm opening this up as a draft first for feedback. Details in the comments.

Resolves: RHBZ#1931023

Feb 26 '21 09:02 nrwahl2

@kgaillot

Regression tests pass, including the new ones I created (but note that a couple of existing ones with buggy behavior had to be updated). This patch does what it needs to do in the cases I've tested.

I've got a few uncertainties though:

Do we need to swap anything else in swap_allocations()? I listed some attributes that may need to be swapped but that didn't cause any problems in testing. I'm particularly concerned about rsc_cons_lhs, rsc_cons, known_on, and (for clones of groups) children. Can you think of any cases for which any of these would need to be swapped?
Is rsc->allowed_nodes guaranteed to contain the same list of nodes (not identical pe_node_t objects but a list whose members have the same details)? For example, if instance:0 has node1->node2->node3 as its allowed_nodes list, must instance:1 have node1->node2->node3? I assumed the answer is "yes," and that simplified the swap implementation. I swapped allowed_nodes to swap all the weights in one fell swoop, in addition to anything else that might be needed from the objects in those lists.
Let me know if I missed any opportunities to use existing helper functions.
Do any special considerations for bundles come to mind? I only tested bundles insofar as they're covered by the existing regression tests.
Same question but for clones of groups.

Not directly related to this patch: I also noticed that native_assign_node() increments count while native_deallocate() doesn't decrement it. Wonder if that's caused any problems that we haven't identified.

EDIT: I did do some testing with promotable clones of groups, and they behaved as expected even before the fix. Can't promise that would be true in all situations.

Feb 26 '21 09:02 nrwahl2

The reason Jenkins failed is that there are now too many tests in cts/scheduler, and make distdir-am fails with "Argument list too long". I'm not sure how to approach that. I guess the make targets would require some tuning.

Feb 26 '21 09:02 nrwahl2

The scope of this grew beyond what I had planned. The "Fix: libpacemaker: Don't shuffle anonymous clone instances" commit (currently 7b0737e) is my main goal.

The "Enclose dup_file_path() in SUPPORT_NAGIOS guard" commit (currently 82ac086) addresses a Jenkins build failure on Fedora 32.

The huge "Build, Test: Scheduler: Split up cts/scheduler directory" commit (currently 4becd5d) addresses an "Argument list too long" issue that caused a bunch of builds to fail, due to the increased number of tests in the cts/scheduler directory after I added the new ones. Most of the changes are just from moving files to new directories.

I'm not at all tied to this approach. If you'd rather fix the argument issue in a different way, I'm all ears :)

Mar 01 '21 06:03 nrwahl2

Everything through splitting up the scheduler directory looks good, let's do those in a separate PR to get those fixes out quickly and be able to focus this one better.

Don't bother with the output formatting change, that will be taken care of at release time.

Mar 01 '21 17:03 kgaillot

I opened #2314 for those first ones. For the time being, I kept those commits in this PR too so that I don't have to update the changes to the test outputs. I can remove them later.

I removed the output formatting change and re-pushed.

Mar 01 '21 20:03 nrwahl2

Excellent analysis. The problem arises because:

A clone's rsc->children are numbered sequentially when they are created (by clone_unpack()).
Anonymous clone instances that are active (or inactive with a pending action) are assigned one of the rsc->children instances (and thus an instance number) and their current node according to the order the node history is encountered in the status section (by unpack_lrm_resource() -> unpack_find_resource() -> find_anonymous_clone()).
Anonymous clone instances (and thus instance numbers) are allocated to their next node according to node scores (lowest instance number to highest node score) (by pcmk__clone_allocate() -> distribute_children() -> allocate_instance()).
The first N instances of a promotable clone are chosen for promotion (by pcmk__set_instance_roles()).

1&2 are more or less a necessity, and 3&4 work together to ensure that the priorities defined by scores are respected.

I don't think we can swap allocations after 3 without risking all sorts of breakage.

This is a half-baked idea at this point, but I'm wondering if we can reorder rsc->children for some steps. Currently the order of rsc->children always matches the numeric order of instances, but I'm not sure that's important to anything -- maybe we could sort rsc->children at some step, and then use the first N children rather than the first N instance numbers. In other words, currently stateful:0 is allocated to node2 because it is the first member of rsc->children and node2 has the highest node score, but what if we could sort rsc->children so that stateful:2 is first, so that gets allocated to node2 instead. The questions are how should the sort be done, and will changing the order affect anything else.

Mar 11 '21 22:03 kgaillot

Pardon the length:

The first N instances of a promotable clone are chosen for promotion (by pcmk__set_instance_roles()).

Yep, although I do want to emphasize that this issue doesn't occur only with promotable clones. It's just easier to reproduce with promotable clones. I'm not sure about the conditions to reproduce it with non-promotable clones, aside from the clone-anon-no-shuffle-constraints test that I provided.

So any solution that's based in pcmk_sched_promotable.c won't solve the entire problem.

I don't think we can swap allocations after 3 without risking all sorts of breakage.

I'm concerned about that also. Everything that's been tested works fine, which is a good sign but not conclusive -- e.g., if nesting of resources or a lot of interwoven constraints end up producing unexpected behavior after a swap.

With that being said, only anonymous clone instances are under discussion, so it shouldn't make any difference which instance ultimately goes to which node, as long as each node has the same number of instances allocated to it both before and after all the swapping. The swap takes place after allocation is finished, and we're not messing with anything that hasn't been allocated. We only consider swapping two instances if both of them have already been allocated, and rsc1 has been allocated to rsc2's current node (or vice versa).

Swapping allowed_nodes swaps the node weights for each instance, and I would expect that each allocated instance must have a copy of the same set of nodes in its allowed_nodes member. If not, then we could iterate over allowed_nodes for each instance and swap if the nodes match -- but that's more complicated than swapping the whole allowed_nodes object. Swapping the node weights ought to keep the scores in alignment with the node assignments, when we call assign_node() next, which wipes out the old allocation from both the instance and the node and then sets the new one. To your point ("3&4 work together to ensure that the priorities defined by scores are respected"), swapping the node weights is what prevents step 4 from breaking (in pcmk__set_instance_roles()). Before I swapped node weights, the wrong node was getting the promoted instance.

My line of thinking is that since these instances are anonymous, if there is any risk in swapping their allocations after step 3, we can avoid that risk by swapping some set of their member variables. We're currently swapping only allowed_nodes, but I hope that any other issues could be remedied by swapping other objects too. Still, this is the main thing that worries me -- are we not swapping some member that we should be swapping? Conversely, it may be dangerous to swap members that don't clearly need to be swapped.

With all of that said, I was originally interested in doing this in allocate_instance() so that no swapping is required later. I can take another look at that if you think it's worthwhile and possible. As far as I could tell it wasn't.

It seems impossible to proactively allocate instances to their current nodes, if it can't already be done with the existing pre-allocation logic -- at least not without a major change. We can only do it retroactively, after all the instances have been allocated and we know which nodes are going to get how many instances. We don't know in advance whether an instance's current node is going to be allocated an instance when all is said and done.

This is a half-baked idea at this point, but I'm wondering if we can reorder rsc->children for some steps. ...

Hmm, it's worth considering. It could be simpler. I'm not sure right away exactly how we would do that (what sort criteria get us the order we need, etc.), whether it could solve the whole problem, or (like you said) what side effects it would have either.

I also wonder if there could be any negative interactions from that if there are fewer instances allocated post-transition than pre-transition. Maybe that part could be solved by ensuring all the allocated instances are contiguous at the front of the before sorting further.

sort_clone_instance(), which gets called before distribute_children(), already tries to do something like this:

    /* allocation order:
     *  - active instances
     *  - instances running on nodes with the least copies
     *  - active instances on nodes that can't support them or are to be fenced
     *  - failed instances
     *  - inactive instances
     */

But that ordering is effectively negated by the time we reach the situation that's described in this issue (i.e., if pre-allocation can't give the instances to their current nodes). And for non-promotable clones, I think we're finished after distribute_children(). So I'm not sure we can fix the non-promotable case by re-ordering children.

Even for the promotable case, where we still have pcmk__set_instance_roles() to work with, I'm concerned that we would ultimately face a "node weight" issue again despite the re-ordering, unless we swap the node weights like I'm doing in swap_allocations().

Mar 21 '21 05:03 nrwahl2

Edit to the below: Nah, not looking super promising due to side effects. Scrap that.

Another half-baked idea: Fool Pacemaker into thinking that an anonymous instance that's been allocated to a node is already running on that node. That way it doesn't schedule a move to place it there. This would require manipulating rsc->running_on and node->details->running_rsc.

Take the instance off the running list of a node where it's currently running (and take the node off the instance's running_on), and push it onto the running list of the node it's been allocated to. The trick would be ensuring that every node ends up with the same number of instances "running" there both before and after the swaps.

I don't know if this would even work or not, and IMO it's harder to reason about than swapping the allocations. But perhaps it would end up with fewer variables needing reassignment.

Either way, if we swap anything at all, then we ought to take care about nested children (e.g., a clone where each instance is a group, whose children in turn are the natives). I re-pushed with a slightly modified swap_allocations() function that crawls through the tree of children if any exist, and swaps the allowed_nodes for each pair of children.

Mar 21 '21 22:03 nrwahl2

I don't know how I forgot about sort_clone_instance(); that's the idea I had in mind, so it would be changing the criteria there. Whether that's feasible or not I don't know.

My gut feeling is it has to be possible to handle this at the sort and/or distribute steps when allocating, so the right instance number gets allocated to the right node. Even if we have to track new info in the instances' pe_resource_t (like we already do with clone_name).

Mar 22 '21 20:03 kgaillot

My gut feeling is it has to be possible to handle this at the sort and/or distribute steps when allocating, so the right instance number gets allocated to the right node. Even if we have to track new info in the instances' pe_resource_t (like we already do with clone_name).

I want it to be possible, and it very well may be. It just seems like the correct decision about which instance should go where, depends on the final result of the allocation (which nodes get how many instances).

For example, in the promotable-anon-recover-promoted test, I don't think we know that stateful:2 should be the instance to go to node2 (where stateful is currently stopped) until we've determined that node1 and node3-rem are each going to receive one instance. Then we know that stateful:0 and stateful:1, which are already running on those two nodes, should stay put. We shouldn't send stateful:0 to node2 based on promotion score. But in the absence of that knowledge, stateful:0 will go to node2 because it gets allocated first and node2 has the highest score.

It seems that determining the ultimately correct sort order in advance would require doing a dry-run of most of the allocation process. Which might work but seems like a non-trivial amount of extra work for the scheduler.

I'll see what I can figure out. There may be a way to handle some or all of this via sort order, and it will come to me later. It might involve tracking new info like you mentioned, and/or using a different sort scheme for pre-allocation versus the main allocation. In the meantime I'm doing a little bit of refactoring of sort_clone_instance for readability, for my own benefit. I'm also not totally convinced that it's 100% adhering to the allocation order that it advertises in the comments...

Mar 23 '21 04:03 nrwahl2

BTW what is node->count meant to keep track of? It seems to be used similarly to node->details->num_resources, but not quite.

Mar 23 '21 04:03 nrwahl2

It seems that determining the ultimately correct sort order in advance would require doing a dry-run of most of the allocation process. Which might work but seems like a non-trivial amount of extra work for the scheduler.

That might not be out of the question. We do something similar when taking constraint dependencies' preferences into account (the "Rolling back" stuff). It would only be extra work in the rare cases where the first allocation is suboptimal.

I'll see what I can figure out. There may be a way to handle some or all of this via sort order, and it will come to me later. It might involve tracking new info like you mentioned, and/or using a different sort scheme for pre-allocation versus the main allocation. In the meantime I'm doing a little bit of refactoring of sort_clone_instance for readability, for my own benefit. I'm also not totally convinced that it's 100% adhering to the allocation order that it advertises in the comments...

I did #2327 while investigating. I hadn't finished what I wanted to do with it but if you want it, grab it. Makes tracing less unpleasant.

Mar 23 '21 14:03 kgaillot

BTW what is node->count meant to keep track of? It seems to be used similarly to node->details->num_resources, but not quite.

->count is only used for clones, and is the number of instances of the particular clone being allocated (it's reset in distribute_children()). Notice count is not in ->details, which means it's not shared between all uses of the node object. When we keep node lists (data_set->nodes, rsc->allowed_nodes, etc.), everything in ->details is shared across all lists, and only the outer pe_node_t object is unique. The outer members can be different for each resource and are used during allocation.

Mar 23 '21 14:03 kgaillot

BTW what is node->count meant to keep track of? It seems to be used similarly to node->details->num_resources, but not quite.

->count is only used for clones, and is the number of instances of the particular clone being allocated (it's reset in distribute_children()). Notice count is not in ->details, which means it's not shared between all uses of the node object. When we keep node lists (data_set->nodes, rsc->allowed_nodes, etc.), everything in ->details is shared across all lists, and only the outer pe_node_t object is unique. The outer members can be different for each resource and are used during allocation.

One day we'll break backward compatibility so we can make bigger changes in the API. It would make more sense to have the ->details object be the node object (data_set->nodes could probably use just these), and call the current node object a node allocation object or something like that.

Mar 23 '21 14:03 kgaillot

The latest push contains a different approach. It doesn't modify the sort order but it does eliminate the swapping. I considered various ways to modify the sort order to achieve what we want, and all of it seemed more complicated than this current approach.

Basically, during pre-allocation, if an instance prefers a different node compared to its current node, we make that node unavailable and try pre-allocation to the instance's current node again. We keep a counter of how many instances we need to reserve for nodes with higher weights. Every time an instance wants to go to a node other than its current node, we increment the reserved counter.

The reserved counter gets reset for each instance that we pre-allocate. We stop trying to pre-allocate an instance to its current node if allocated plus reserved reaches max -- allocating that instance to its current node is of too low a priority.

We have to do some bookkeeping of allowed_nodes along the way, since we have to modify an instance's allowed_nodes to tell the allocator which nodes should be unavailable. I try to explain in the comments as needed.

I also added three refactor commits that helped make sort_clone_instance() more readable for me when I was familiarizing myself with it.

I just noticed this conflicts with your WIP #2327, so one of our PRs would have to be updated. It wouldn't be hard to check rsc->allocated_to after an allocation instead of getting the chosen node from the return value.

Also note that a block from allocate_instance() has been moved to distribute_children as part of my current approach, so that we have direct access to allocated and max.

Mar 29 '21 08:03 nrwahl2

Opened #2333 with refactor commits. Minor changes compared to the refactors in this PR:

Renamed sort_instance_by_colocation to order_instance_by_colocation, since we're not sorting a list.
Updated comparison order comments to be declarative ("X sorts first") instead of solely questions.
Compared nnodes > 0 explicitly.
Compared node1 and node2 != NULL explicitly in the "current location is available" section. This was already being done elsewhere.

Mar 29 '21 18:03 nrwahl2

Latest push is simply a rebase onto current master, to incorporate #2333.

Mar 29 '21 19:03 nrwahl2

FYI, I just merged a commit that changed crm_simulate's output formatting, so the scheduler regression test outputs here will need updating

Apr 05 '21 15:04 kgaillot

I've pushed with some changes that address problems and sources of confusion that arose along the way. Let me know if you want some of these to go into a separate PR for faster merging.

Rebased on current master and updated regression test outputs.
Removed the extraneous, whitespace-only regression test output changes that were accidentally added on the last push.
There's a new pcmk__location_e bitfield enum. This is currently private and contains values for allocated and running. The idea is to make it public at a compatibility break and use it in native_rsc_location() with the addition of a flag for pending.
pe_node_attribute_calculated() accepts a pcmk__location_e enum to determine whether it looks up the container's allocated host or current host. promotion_score() gets called once in apply_master_prefs() (pre-allocation) and once in pcmk__set_instance_roles(), so lookup_promotion_score() checks the provisional state to determine which location enum to use.
The apply_master_prefs() function has been renamed to pcmk__apply_promotion_prefs(). It's internal API.
promotion_order() is broken into helper functions for readability. This is especially important considering other changes that needed to be introduced, which add more ordering code.
Fixed an issue where positive bundle colocations (in both directions) are currently ignored when determining promotion order.

I'm still thinking about the best way to let negative colocations influence the promotion order. Remember the case that started all this: an IPaddr2 is colocated with a bundle's promoted instance and also banned from a particular host (controller-2). We have to tell the scheduler to prefer not to promote the bundle on controller-2. (I haven't checked whether the same problem affects regular promotable clones right now.)

However, the bundle shouldn't be prevented from promoting on controller-2, since the IPaddr2 is the dependent resource. It should only be discouraged from promoting there. It should be allowed to start on controller-2 if there is no better node with respect to dependent colocated resources' preferences.

Through all this, we don't want to overwrite the sort_index, because it holds the "final" preference based on promotion score, promoted role, and positive colocations. I think we want to be able to fall back on that. So we may need to store an additional piece of data related to the negative colocations, or use the node weights directly for an initial check and then fall back on sort_index if the weights are equal (or something similar).

On a related note, we also have to determine how the negative dependent colocations should weigh against the positive colocations -- e.g., whether they should be given the same precedence or not.

Anyway, this should all be easier to reason about now with the refactors and with the positive colocations fixed.

Apr 06 '21 08:04 nrwahl2

I'll wait on a full review until you're done with whatever you're planning (no rush)

Apr 19 '21 20:04 kgaillot

Thanks, it's been pretty busy for me lately :)

New push to rebase on current master and update according to feedback:

Incorporated all the master -> promoted changes.
Changed pcmk__location_e to pcmk__rsc_node_e and updated the value names accordingly. Broke the addition of the enum into its own commit, and broke the use of the enum in pe_node_attribute_calculated into its own commit.
Moved score array declaration outside the for loop in apply_sort_indices_to_wrapper_weights().
Introduced PCMK__SCORE_MAX_LEN (defined as 9, via "-INFINITY") and used it in place of arbitrary-length arrays for score2char_stack(). I wasn't sure whether to put this in crm.h (with the INFINITY macros), util.h (with score2char_stack(), or internal.h. A case could be made for either of the public headers, but I put it in internal.h for now because we don't need it externally. I'm happy to move it though.
Made pointer and integer variables const where appropriate in newly added functions.
Fixed the work_nodes scheme in apply_promoted_dependent_colocations() and apply_promoted_primary_colocations() (I hope ;) ), by instead making and restoring from a backup table in the case of bundles.

I'm hoping to get a draft of the negative colocation stuff for promotion order taken care of by end of week. That's the last piece of the puzzle. I believe we need to consider negative preference scores of resources that depend on the promotable clone, without those dependent scores downright preventing a primary resource from promoting on a particular node.

Apr 21 '21 09:04 nrwahl2

FYI, I just started the 2.1.0 release cycle. If you get back to this, feel free to keep it against master or resubmit against 2.1 as desired.

EDIT: 2.1.0 has been released, so keep this against master.

Apr 30 '21 21:04 kgaillot

pacemaker pacemaker copied to clipboard

Fix: libpacemaker: Don't shuffle anonymous clone instances unnecessarily

pacemaker
pacemaker copied to clipboard