anax icon indicating copy to clipboard operation
anax copied to clipboard

Bug: NMP statuses not removed from the exchange after node no longer matches

Open MaxMcAdam opened this issue 2 years ago • 3 comments

Describe the bug.

No response

Describe the steps to reproduce the behavior.

No response

Expected behavior.

No response

Screenshots.

No response

Operating Environment

any

Additional Information

No response

MaxMcAdam avatar May 19 '22 16:05 MaxMcAdam

@MaxMcAdam - Tried to verify the defect today on hzn version 913.

root@tbsK3agent1:~/444-agentfiles# hzn version Horizon CLI version: 2.30.0-913 Horizon Agent version: 2.30.0-913

Node being used is a k3s edge cluster agent: tbsk3agent

We were using this nmp file: cat tbs-cert-upgrade.json { "label": "tbs test in kmb-org", "description": "tbs cert upgrade", "enabled": true, "constraints": [ "openhorizon.example == "operator"" ], "start": "now", "startWindow": 0, "agentUpgradePolicy": { "manifest": "cert102", "allowDowngrade": false }

Karen ran an upgrade to the tbsk3agent node and checked the nmp status: (node was already at latest level - so no action required : nmp status reflects that at this point: hzn ex nmp status tbsupgrade -u root/root:glkPRbwwFbvGZThtlnHJOKgVMJMOax { "kmb-org/tbsK3agent1": "no action required"

hzn eventlog list from the node:

New node management policy status created for policy kmb-org/tbsupgrade.", "2022-06-23 19:04:46: Node management status for kmb-org/tbsupgrade changed to download started.", "2022-06-23 19:04:46: Node management status for kmb-org/tbsupgrade changed to no action required.",

We then removed a node property from the tbsk3sagent node. I removed node property openhorizon.example=operator. The active agreement was taken down - see eventlog below:

2022-06-23 19:12:45: Node policy updated with the Exchange copy: map[deployment:map[properties: constraints:] management:map[properties: constraints:] properties:[map[name:openhorizon.arch value:amd64] map[name:openhorizon.cpu value:2] map[name:openhorizon.allowPrivileged value:true] map[value:v1.21.12-rc1+k3s1 name:openhorizon.kubernetesVersion] map[value:2,079 name:openhorizon.memory]] constraints:]", "2022-06-23 19:12:55: Start terminating agreement for ibm.nginx-operator. Termination reason: node policy changed", "2022-06-23 19:12:55: Complete terminating agreement for ibm.nginx-operator. Termination reason: node policy changed", "2022-06-23 19:12:55: Workload destroyed for ibm.nginx-operator",

So - at this point, the nmp for this node no longer matches. The nmp status should be removed.

We checked the nmp with dryrun to confirm the nmp no longer matched the node:

root@kmbt21:~# hzn ex nmp add -f tbs-cert-upgrade.json tbsupgrade --dry-run --applies-to []

This cmd still shows that nmp status is present for the node (we waited approx 20 mins, checking periodically.

root@kmbt21:~# hzn ex nmp status tbsupgrade { "kmb-org/tbsK3agent1": "no action required"

I then added the node property openhorizon.examaple==operator back into the node properties via the UI.

"2022-06-23 19:34:36: Node policy updated with the Exchange copy: map[constraints: deployment:map[properties:[map[name:openhorizon.example value:operator type:string]] constraints:] management:map[properties: constraints:] properties:[map[name:openhorizon.arch value:amd64] map[name:openhorizon.cpu value:2] map[name:openhorizon.allowPrivileged value:true] map[name:openhorizon.kubernetesVersion value:v1.21.12-rc1+k3s1] map[value:2,079 name:openhorizon.memory]]]", "2022-06-23 19:34:37: Node received Proposal message using agreement 31cd9ddfa8c9af582326ee51dc1caba9b77df28acbed933deb780bccae84ffa8 for service IBM/ibm.nginx-operator from the agbot IBM/agbot.", "2022-06-23 19:34:47: Agreement reached for service ibm.nginx-operator. The agreement id is 31cd9ddfa8c9af582326ee51dc1caba9b77df28acbed933deb780bccae84ffa8.", "2022-06-23 19:34:47: Start workload service for IBM/ibm.nginx-operator.", "2022-06-23 19:34:52: Workload service containers for IBM/ibm.nginx-operator are up and running."

So - now the nmp policy matches the node again - as shown below by the nmp add --dry-run cmd:

hzn ex nmp add -f tbs-cert-upgrade.json tbsupgrade --dry-run --applies-to [ "kmb-org/tbsK3agent1"

And.... now the nmp status has been removed:

root@kmbt21:~# hzn ex nmp status tbsupgrade Error: Status for NMP tbsupgrade not found in org kmb-org

Seems like the nmp status is not being keyed on the correct action to remove the status when the existing agreement is torn down and the properties no longer match the nmp.

It appears nmp status is not getting cleared until the node comes back up and forms a new agreement.

Karen is familar with this setup, and still has the node connected to her org, in case you need more info or another recreate.

tbsloan avatar Jun 23 '22 22:06 tbsloan

@tbsloan it takes a minute or two to have the effect. Can you just do one change and wait for the result and see if it is correct?

linggao avatar Jun 24 '22 20:06 linggao

@linggao @MaxMcAdam Karen and I retested this nmp status removal scenario again on Wed.

We found that it appears the removal of the nmp status for a node is triggered, based on 'management' properties being defined, such as in the nmp below:

{ "management": { "properties": [ { "name": "tbsnode", "value": "manageme" } ] } }

Using the nmp above, we saw that the nmp status was removed for that node after a short time, once the nmp policy no longer matched for that node.

The original issue we noticed still exists, if a higher-level node property is defined in the nmp, like in this nmp policy file:

{ "label": "tbs test in kmb-org", "description": "tbs cert upgrade", "enabled": true, "constraints": [ "openhorizon.example == "operator"" ], "start": "now", "startWindow": 0, "agentUpgradePolicy": { "manifest": "cert102", "allowDowngrade": false }

Using the nmp above (which does not contain explicit 'mamagement' properties, the nmp status is still not cleared, once the nmp no longer matches for the target node. This is likely working as designed and is probably OK for now.

For future function in this area, if an nmp matches a node on the --appliesTo dryrun test, it would be nice if the nmp status could be removed for that node, once a node property is removed and the nmp policy no longer matches for that node.

tbsloan avatar Jul 08 '22 13:07 tbsloan