f5-declarative-onboarding icon indicating copy to clipboard operation
f5-declarative-onboarding copied to clipboard

BIG-IP clustering failures

Open ghost opened this issue 4 years ago • 8 comments

Environment

  • Declarative Onboarding Version: 1.23.0
  • BIG-IP Version: 16.1.0 build 0.0.19

Summary

I'm seeing failures to cluster when using DO via bigip-runtime-init in AWS.

Steps To Reproduce

I'm using Terraform to deploy BIG-IP instances via bigip-runtime-ip (DO and AS3) ... this is my project's repository: https://github.com/grf5/f5-bigip-aws-tgw-lambda-failover

Steps to reproduce the behavior:

  1. Submit the following declaration (translated from the bigip-runtime-init YAML:
{
  "schemaVersion": "1.23.0",
  "class": "Device",
  "async": true,
  "label": "BIG-IP Onboarding",
  "Common": {
    "class": "Tenant",
    "systemConfig": {
      "class": "System",
      "autoCheck": false,
      "autoPhonehome": false,
      "cliInactivityTimeout": 3600,
      "consoleInactivityTimeout": 3600,
      "hostname": "mybigip.lab.local"
    },
    "sshdConfig": {
      "class": "SSHD",
      "inactivityTimeout": 3600,
      "protocol": 2
    },
    "customDbVars": {
      "class": "DbVariables",
      "provision.extramb": 500,
      "restjavad.useextramb": true,
      "ui.system.preferences.recordsperscreen": 250,
      "ui.system.preferences.advancedselection": "advanced",
      "ui.advisory.enabled": true,
      "ui.advisory.color": "green",
      "ui.advisory.text": "AWS Instance"
    },
    "ntpConfiguration": {
      "class": "NTP",
      "servers": [
        "169.254.169.123",
        "0.pool.ntp.org",
        "1.pool.ntp.org",
        "2.pool.ntp.org"
      ],
      "timezone": "EST"
    },
    "Provisioning": {
      "class": "Provision",
      "ltm": "nominal",
      "asm": "nominal"
    },
    "admin": {
      "class": "User",
      "userType": "regular",
      "password": "${bigipAdminPassword}",
      "shell": "bash"
    },
    "data-vlan": {
      "class": "VLAN",
      "interfaces": [{
        "name": "1.1",
        "tagged": false
      }],
      "mtu": 9001
    },
    "data-self": {
      "class": "SelfIp",
      "address": "{{{ DATAPLANE_IP }}}",
      "vlan": "data-vlan",
      "allowService": "default",
      "trafficGroup": "traffic-group-local-only"
    },
    "data-default-route": {
      "class": "Route",
      "gw": "{{{ DATAPLANE_GATEWAY }}}",
      "network": "default",
      "mtu": 1500
    },
    "cmConfigSync": {
      "class": "ConfigSync",
      "configsyncIp": "/Common/data-self/address"
    },
    "cmFailoverAddress": {
      "class": "FailoverUnicast",
      "address": "/Common/data-self/address"
    },
    "cmTrust": {
      "class": "DeviceTrust",
      "localUsername": "admin",
      "localPassword": "${bigipAdminPassword}",
      "remoteHost": "${cm_peer_mgmt_ip}",
      "remoteUsername": "admin",
      "remotePassword": "${bigipAdminPassword}"
    },
    "trafficGroup": {
      "class": "TrafficGroup",
      "autoFailbackEnabled": false,
      "failoverMethod": "ha-order",
      "haOrder": [
        "${cm_self_hostname}",
        "${cm_peer_hostname}"
      ]
    },
    "failoverGroup": {
      "class": "DeviceGroup",
      "type": "sync-failover",
      "members": [
        "${cm_self_hostname}",
        "${cm_peer_hostname}"
      ],
      "owner": "${cm_failover_group_owner}",
      "autoSync": true,
      "saveOnAutoSync": true,
      "networkFailover": true,
      "fullLoadOnSync": false,
      "asmSync": false
    }
  }
}
  1. Observe the following error response in /var/log/restnoded/restnoded.log:
Wed, 13 Oct 2021 02:54:39 GMT - finest: [f5-declarative-onboarding: restWorker.js] tryUntil: got error {"code":404}
Wed, 13 Oct 2021 02:54:39 GMT - finest: [f5-declarative-onboarding: restWorker.js] typeof err object
Wed, 13 Oct 2021 02:54:39 GMT - finer: [f5-declarative-onboarding: restWorker.js] tryUntil error: 01020036:3: The requested device group (/Common/failoverGroup) was not found. tries left: 0
Wed, 13 Oct 2021 02:54:39 GMT - finest: [f5-declarative-onboarding: restWorker.js] tryUntil: retryOrReject: numRemaining: 0 , code: 404 , message: 01020036:3: The requested device group (/Common/failoverGroup) was not found.
Wed, 13 Oct 2021 02:54:39 GMT - finer: [f5-declarative-onboarding: restWorker.js] Max tries reached.
Wed, 13 Oct 2021 02:54:39 GMT - finest: [f5-declarative-onboarding: restWorker.js] tryUntil: got error {"code":404,"message":"tryUntil: max tries reached: 01020036:3: The requested device group (/Common/failoverGroup) was not found.","name":"Error"}
Wed, 13 Oct 2021 02:54:39 GMT - finest: [f5-declarative-onboarding: restWorker.js] typeof err object
Wed, 13 Oct 2021 02:54:39 GMT - finer: [f5-declarative-onboarding: restWorker.js] tryUntil error: tryUntil: max tries reached: 01020036:3: The requested device group (/Common/failoverGroup) was not found. tries left: 231
Wed, 13 Oct 2021 02:54:39 GMT - finest: [f5-declarative-onboarding: restWorker.js] tryUntil: retryOrReject: numRemaining: 231 , code: 404 , message: tryUntil: max tries reached: 01020036:3: The requested device group (/Common/failoverGroup) was not found.

Expected Behavior

I would expect clustering to complete and the group to become in-sync/active/standby.

Actual Behavior

The DO never completes. Qkviews from both nodes available on iHealth:

  • Primary: https://ihealth.f5.com/qkview-analyzer/qv/17033514/status/overview
  • Secondary: https://ihealth.f5.com/qkview-analyzer/qv/17033506/status/overview

ghost avatar Oct 13 '21 03:10 ghost

I appear to be having similar problems when provisioning is added to my DO declaration The first device in cluster succeeds but 2nd device fails. When removing provisioning from the DO I do not appear to have problems standing up the cluster.

Different variations used, but generally attempting to use simple LTM and AVR,

            "myProvision": {
                "ltm": "nominal",
                "class": "Provision",
                "avr": "minimum"
            },

This is using the BIG-IQ based DO, running BIG-IQ and BIG-IP instances on local KVM hypervisor (lab). Appears to be DO version 1.21.0 (as seen in BIG-IQ DO show API)


    "class": "DO",
    "declaration": {
        "schemaVersion": "1.21.0",
Sys::Version
Main Package
  Product     BIG-IQ
  Version     8.1.0.2
  Build       0.0.36
  Edition     Point Release 2
  Date        Sat Oct  2 21:52:10 PDT 2021
Sys::Version
Main Package
  Product     BIG-IP
  Version     16.1.0
  Build       0.0.19
  Edition     Final
  Date        Tue Jun 22 23:52:22 PDT 2021

samualblair avatar Dec 03 '21 22:12 samualblair

@samualblair Are you still experiencing this issue with the latest version of DO (1.27.0)?

dstokesf5 avatar Jan 28 '22 21:01 dstokesf5

@dstokesf5

I think my overall understanding of the situation, and how DO work is better now. When I posted the previous comment I had not tried actually done a DO push to the BIG-IP directly at that time, so no DO was actually installed on the BIG-IP.

The BIG-IQ version was correctly as/is 1.21.0

Now that I have been performing more tests I have successful DO pushes directly to BIG-IP (with DO installed) as well as to BIG-IP (that do not have DO) through the BIG-IQ (which still has DO 1.21.0) and I have been seeing success.

It seems that the only failures I was having were when I am using the BIG-IQ built-in On-boarding page. I will be performing a few more tests to see if those issues are still occurring as I recall they did before, but I don't see why they would not still be occurring, even with my misunderstanding of how/where DO executed the failures were reproducible previously. Then again sure why that would fail if everything else is working so I will be retesting be sure.

My question is, when you ask am I seeing the issue with DO 1.27.0 , do you know how can I upgrade the DO instance on the BIG-IQ? The URL/RPM install/update process I used on the BIG-IP doesn't appear to be working. Or can you only upgrade DO on BIG-IQ with system upgrades? If so I didn't see any newer versions of BIG-IQ out yet so I'm not sure that I could.

samualblair avatar Feb 03 '22 21:02 samualblair

@dstokesf5

UPDATE: OK, update and Correction to my last comment. I am not crazy, just a little careless in my assumptions, DO is in failing on the BIG-IQ GUI management as well as with BIG-IQ DO REST calls from my workstation.

It looks like my DO pushes through the BIG-IQ (using REST calls) were actually failing, spending to much time on this I got excited and thought this was all working but it wasn't. I caught on after a while that the cm status was still 'standalone' (I know I should have seen this earlier). Anyway it was hanging and working on a very, very, slow timeout before rolling back.

So when I push directly to 2x BIG-IP running DO 1.27.0, this works as expected.

When I push to BIG-IQ DO (to configure 2x BIG-IP) this fails, similar to when I craft/push through the BIG-IQ GUI management. This at least makes a lot more sense in my mind.

Back to my question then, is there a process to upgrade BIG-IQ DO version you can point me to? If so I will gladly upgrade BIG-IQ and test with DO 1.27.0

samualblair avatar Feb 03 '22 21:02 samualblair

@dstokesf5

The specific failure that I am seeing logged, after all the timeouts, is:

[ "tryUntil: max tries reached: tryUntil: max tries reached: 01020036:3: The requested device group (/Common/myDeviceGroup-HA) was not found.", "unicastAddresses.map is not a function" ]

Looking through bugs it seems that this is more likely caused by issue 199: https://github.com/F5Networks/f5-declarative-onboarding/issues/199

Possibly even issue 201 (although looks from my perspective to be same underlying issue): https://github.com/F5Networks/f5-declarative-onboarding/issues/201

To be clear I am not using any IPv6 in my declarations (now at least) but the device does have v4 and v6 on mgmt. My management is in fact receiving/has IPv4 and IPv6 addresses (via DHCP & SLAAC) prior to the initial DO push. In either of those cases it should have been resolved in DO 1.22 , but the BIG-IQ is still running 1.21

I will try to statically assign only v4, and then see if pushes are working as expected.

samualblair avatar Feb 03 '22 22:02 samualblair

@dstokesf5

I disabled DHCP and removed the IPv6 management IP but left the IPv4 mangement IP on the BIG-IP's, then pushed my DO's through BIG-IQ, still running DO v1.21.0 . Pushes are successful. The BIG-IP devices fully load the code, form a cluster, and show IN SYNC status.

This is really looking like issue: https://github.com/F5Networks/f5-declarative-onboarding/issues/199

samualblair avatar Feb 03 '22 23:02 samualblair

@dstokesf5

Alright I found the details on upgrading BIG-IQ DO and have done so. Now I have attempted with BIG-IQ DO running 1.27.0.

"schemaCurrent": "1.27.0", "schemaMinimum": "1.0.0", "selfLink": "https://localhost/mgmt/shared/declarative-onboarding/info", "version": "1.27.0"

To be clear I have tried again without modification (so management has IPv4 and IPv6 configured via DHCP and SLAAC).

It appears to be failing just as before, failing to create the HA group.

Mon, 07 Feb 2022 21:22:34 GMT - finest: [f5-declarative-onboarding: restWorker.js] tryUntil: got error {"code":404} Mon, 07 Feb 2022 21:22:34 GMT - finest: [f5-declarative-onboarding: restWorker.js] typeof err object Mon, 07 Feb 2022 21:22:34 GMT - finer: [f5-declarative-onboarding: restWorker.js] tryUntil error: 01020036:3: The requested device group (/Common/myDeviceGroup-UT-HA-INT) was not found. tries left: 0 Mon, 07 Feb 2022 21:22:34 GMT - finest: [f5-declarative-onboarding: restWorker.js] tryUntil: retryOrReject: numRemaining: 0 , code: 404 , message: 01020036:3: The requested device group (/Common/myDeviceGroup-UT-HA-INT) was not found. Mon, 07 Feb 2022 21:22:34 GMT - finer: [f5-declarative-onboarding: restWorker.js] Max tries reached. Mon, 07 Feb 2022 21:22:34 GMT - finest: [f5-declarative-onboarding: restWorker.js] tryUntil: got error {"code":404,"message":"tryUntil: max tries reached: 01020036:3: The requested device group (/Common/myDeviceGroup-UT-HA-INT) was not found.","name":"Error"} Mon, 07 Feb 2022 21:22:34 GMT - finest: [f5-declarative-onboarding: restWorker.js] typeof err object Mon, 07 Feb 2022 21:22:34 GMT - finer: [f5-declarative-onboarding: restWorker.js] tryUntil error: tryUntil: max tries reached: 01020036:3: The requested device group (/Common/myDeviceGroup-UT-HA-INT) was not found. tries left: 13 Mon, 07 Feb 2022 21:22:34 GMT - finest: [f5-declarative-onboarding: restWorker.js] tryUntil: retryOrReject: numRemaining: 13 , code: 404 , message: tryUntil: max tries reached: 01020036:3: The requested device group (/Common/myDeviceGroup-UT-HA-INT) was not found.

Once all retries have finished the systems successfully rolled back (to initial blank configuration).

I will remove the IPv6 management address and try again to confirm behavior, posting the results later today.

samualblair avatar Feb 07 '22 21:02 samualblair

@dstokesf5

I can confirm that running 1.27.0 on BIG-IQ after removing the IPv6 management IP (on BIG-IP) this completes as expected.

Seeing as this doesn't appear to be fixed in DO 1.27.0 I wonder if perhaps related to one of these issues? https://github.com/F5Networks/f5-declarative-onboarding/issues/272 https://github.com/F5Networks/f5-declarative-onboarding/issues/260

To be clear this seems to be happening with otherwise fresh BIG-IP's with just root/admin password changed prior to DO push through BIG-IQ. The Mangement IP's are not being configured with DO at this time (no class ManagementIp or class ManagementRoute included in the DO).

samualblair avatar Feb 08 '22 00:02 samualblair