cli icon indicating copy to clipboard operation
cli copied to clipboard

mas upgrade command for one instance requires manage on a different instance to be up and running.

Open creyeshs opened this issue 8 months ago • 2 comments

MAS CLI version

quay.io/ibmmas/cli:13.15.0

CLI function used

upgrade

What happened?

First of all this is an airgap environment. I ran a "mas upgrade" command where the instance selection gave me 3 options. I typed the second instance (inst2) and proceeded with the upgrade.

  1. Instance Selection Select a MAS instance to upgrade from the list below:
  • inst1 v8.11.7
  • inst2 v8.11.7
  • inst3 v8.11.7

Enter MAS instance ID: inst2

  1. License Terms To continue with the upgrade, you must accept the license terms:
  • https://ibm.biz/MAS90-License
  • https://ibm.biz/MaximoIT90-License
  • https://ibm.biz/MAXArcGIS90-License Do you accept the license terms? [y/n] y
  1. Review Settings Instance ID ..................... inst2 Current MAS Channel ............. 8.11.x Next MAS Channel ................ 9.0.x Skip Pre-Upgrade Checks ......... False

Proceed with these settings?? [y/n] y

  1. Launch Upgrade ✅️ OpenShift Pipelines Operator is installed and ready to use ✅️ Namespace is ready (mas-inst2-pipelines) ✅️ Latest Tekton definitions are installed (v13.15.0) ✅️ PipelineRun for inst2 upgrade submitted

Now the upgrade can not continue because manage on instance inst3 is not up. Every instance should be totally independent from each other. Why will I need one instance to be up to be able to upgrade a different instance.

The pre-upgrade-check step fails here:

TASK [ibm.mas_devops.ocp_verify : Check Deployment & StatefulSet Status] ******* Checking Deployments are healthy (1/40 retries with a 300 second delay) [NOTREADY] mas-inst3-manage/inst3-masgolf-all = 1 replicas/None ready/1 updated/None available Finished check: Delaying 300 seconds before next check

Relevant log output

step-ocp-verify-workloads

Export all env vars defined in /workspace/settings
Using /opt/app-root/src/ansible.cfg as config file
[WARNING]: No inventory was parsed, only implicit localhost is available
[WARNING]: provided hosts list is empty, only localhost is available. Note that
the implicit localhost does not match 'all'
running playbook inside collection ibm.mas_devops
[DEPRECATION WARNING]: community.general.yaml has been deprecated. The plugin 
has been superseded by the the option `result_format=yaml` in callback plugin 
ansible.builtin.default from ansible-core 2.13 onwards. This feature will be 
removed from community.general in version 13.0.0. Deprecation warnings can be 
disabled by setting deprecation_warnings=False in ansible.cfg.

PLAY [localhost] ***************************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [ibm.mas_devops.ansible_version_check : Verify minimum Ansible version is 2.10.3] ***
ok: [localhost] => changed=false 
  msg: All assertions passed

TASK [ibm.mas_devops.ocp_verify : Check if cluster is ready] *******************
skipping: [localhost] => changed=false 
  false_condition: verify_cluster
  skip_reason: Conditional result was False

TASK [ibm.mas_devops.ocp_verify : Check CatalogSource Status] ******************
skipping: [localhost] => changed=false 
  false_condition: verify_catalogsources
  skip_reason: Conditional result was False

TASK [ibm.mas_devops.ocp_verify : Check Subscription Status] *******************
skipping: [localhost] => changed=false 
  false_condition: verify_subscriptions
  skip_reason: Conditional result was False

TASK [ibm.mas_devops.ocp_verify : Check Deployment & StatefulSet Status] *******
Checking Deployments are healthy (1/40 retries with a 300 second delay)
[NOTREADY] mas-inst3-manage/inst3-masgolf-all = 1 replicas/None ready/1 updated/None available
Finished check: Delaying 300 seconds before next check
Checking Deployments are healthy (2/40 retries with a 300 second delay)
[NOTREADY] mas-inst3-manage/inst3-masgolf-all = 1 replicas/None ready/1 updated/None available
Finished check: Delaying 300 seconds before next check
Checking Deployments are healthy (3/40 retries with a 300 second delay)
[NOTREADY] mas-inst3-manage/inst3-masgolf-all = 1 replicas/None ready/1 updated/None available
Finished check: Delaying 300 seconds before next check
Checking Deployments are healthy (4/40 retries with a 300 second delay)
[NOTREADY] mas-inst3-manage/inst3-masgolf-all = 1 replicas/None ready/1 updated/None available
Finished check: Delaying 300 seconds before next check
Checking Deployments are healthy (5/40 retries with a 300 second delay)
[NOTREADY] mas-inst3-manage/inst3-masgolf-all = 1 replicas/None ready/1 updated/None available
Finished check: Delaying 300 seconds before next check

creyeshs avatar Apr 03 '25 22:04 creyeshs

Please look at IBM case number [TS018940690]. I perfectly understand that at cluster level things should be working but for several reasons instances should be independent. One manage should not depend on another

creyeshs avatar Apr 07 '25 19:04 creyeshs

Our install, upgrade, and update pre checks are designed to prioritize safety/cautious & be extra pessimistic if they see anything unhealthy in the cluster ... If we detect any unhealthy catalogs/operators/deployments/statefulsets in the cluster then we abort the operation before it makes any changes to the cluster -- based on the assumption that the problem found may indicate a wider problem on the cluster which could impact the upgrade/install/update that you are about to start.

This safety check can be disabled by adding --skip-pre-check to the command. The actual health of the other MAS instances won't really affect this, as long as the basic kubernetes resources are healthy .. The deployment that is unhealthy here happens to be a MAS/manage one, but the failure is purely being assessed at a Kubernetes resource level ... There are one or more unhealthy deployments in this cluster, which may indicate that there is a problem in the cluster, so we will not proceed with this action (which places a reasonably amount of load on the kubernetes API server).

durera avatar Apr 10 '25 19:04 durera