operator-sdk
operator-sdk copied to clipboard
Run bundle and upgrade bundle does not work when the bundles is not added to default channel
Bug Report
The command run bundle and upgrade bundle will fail when the bundle informed is not configured to be in its default channel:
operators.operatorframework.io.bundle.channel.default.v1: alpha
operators.operatorframework.io.bundle.channels.v1: mce-2.0
By default, the commands create a new index and try to add the bundle to it. So, when SDK call OPM, it fails in:
https://github.com/operator-framework/operator-registry/blob/v1.22.0/pkg/sqlite/load.go#L426-L428
https://github.com/operator-framework/operator-registry/blob/fd85a98cd00fdd70e30ce6e7076ea37e2583e724/pkg/sqlite/loadprocs.go#L118-L131
What did you do?
Following the steps
$ operator-sdk run bundle quay.io/operatorhubio/hive-operator:v2.5.3508-6cb94c6
INFO[0014] Successfully created registry pod: quay-io-operatorhubio-hive-operator-v2-5-3508-6cb94c6
INFO[0014] Created CatalogSource: hive-operator-catalog
INFO[0014] OperatorGroup "operator-sdk-og" created
INFO[0014] Created Subscription: hive-operator-v2-5-3508-6cb94c6-sub
FATA[0120] Failed to run bundle: install plan is not available for the subscription hive-operator-v2-5-3508-6cb94c6-sub: timed out waiting for the condition
And then, by checking the bundle logs: (kubectl logs pod/quay-io-operatorhubio-hive-operator-v2-5-3508-6cb94c6)
$ kubectl logs pod/quay-io-operatorhubio-hive-operator-v2-5-3508-6cb94c6
time="2022-05-11T00:46:00Z" level=warning msg="\x1b[1;33mDEPRECATION NOTICE:\nSqlite-based catalogs and their related subcommands are deprecated. Support for\nthem will be removed in a future release. Please migrate your catalog workflows\nto the new file-based catalog format.\x1b[0m"
time="2022-05-11T00:46:00Z" level=info msg="adding to the registry" bundles="[quay.io/operatorhubio/hive-operator:v2.5.3508-6cb94c6]"
time="2022-05-11T00:46:01Z" level=info msg="Could not find optional dependencies file" file=bundle_tmp1603466453/metadata load=annotations with=./bundle_tmp1603466453
time="2022-05-11T00:46:01Z" level=info msg="Could not find optional properties file" file=bundle_tmp1603466453/metadata load=annotations with=./bundle_tmp1603466453
time="2022-05-11T00:46:01Z" level=info msg="Could not find optional dependencies file" file=bundle_tmp1603466453/metadata load=annotations with=./bundle_tmp1603466453
time="2022-05-11T00:46:01Z" level=info msg="Could not find optional properties file" file=bundle_tmp1603466453/metadata load=annotations with=./bundle_tmp1603466453
time="2022-05-11T00:46:01Z" level=error msg="permissive mode disabled" bundles="[quay.io/operatorhubio/hive-operator:v2.5.3508-6cb94c6]" error="error loading bundle into db: FOREIGN KEY constraint failed"
Error: error loading bundle into db: FOREIGN KEY constraint failed
Usage:
opm registry add [flags]
Also, we found the same above issue by using the operator-sdk run bundle-upgrade, see: https://github.com/k8s-operatorhub/community-operators/runs/6364587418?check_suite_focus=true#step:3:7120 (More info: https://github.com/k8s-operatorhub/community-operators/issues/1195 )
What did you expect to see?
The bundle and upgrade bundle working.
What did you see instead? Under which circumstances?
The bundle is not shipped in the default channel. ( The following issues were closed in favor of this one so we can try to centralize the info )
- https://github.com/operator-framework/operator-sdk/issues/5410
- https://github.com/operator-framework/operator-sdk/issues/5616
- https://github.com/operator-framework/operator-sdk/issues/5413
- https://github.com/operator-framework/operator-sdk/issues/4100
- https://github.com/k8s-operatorhub/community-operators/issues/1195
Possible Solution
SDK commands replace the info provided via the default channel with `` when an index is not formed. So that OPM will not try to update it. Unless a user provides the index to the commands, their motivation with them would not be impacted:
- The goal of running the bundle is only to check if the bundle can be deployed with OLM, so the default channel is not relevant
- The goal of the upgrade bundle is to check if is a possible upgrade from the bundle installed to the new one so unless someone informed an index, the default channel is irrelevant.
Workarounds:
For SDK users that are using it to test the bundle locally
operators.operatorframework.io.bundle.channel.default.v1: alpha
operators.operatorframework.io.bundle.channels.v1: alpha, mce-2.0 // add the default channel to the bundle's channels
For CI/pipelines:
The workaround would be to generate a different temporary bundle adding the default channel to channels. So, this channel would be created in the operator registry; see that all channels will be created or updated before we try to set the default channel: https://github.com/operator-framework/operator-registry/blob/v1.22.0/pkg/sqlite/load.go#L419-L429.
Additional context
- In OPM, we propose providing context to the errors faced in the SQL statements so we can better clarify scenarios like this one https://github.com/operator-framework/operator-registry/pull/953
Hi @rashmigottipati, and @jmrodri, I tried to centralise all only a task after checking this scenario.
This shows that the required changes in the commands to support FBC might also be an option to solve this scenario at least when a user does not provide as arg an index using SQL format.
c/c @VenkatRamaraju
@asmacdo @rashmigottipati is a bug. Could we add here the bug label? Also, could please clarify why it needs discussion? What discussion is required about this?
Thanks @camilamacedo86 for steering it
@J0zi @camilamacedo86 PR https://github.com/operator-framework/operator-sdk/pull/5809 was merged into master. This adds support for FBC images and I believe this addition should resolve your issue as well.
I ran the bundle provided in the description against latest master and it successfully installed the CSV.
Below are the logs: ▶ ./build/operator-sdk run bundle quay.io/operatorhubio/hive-operator:v2.5.3508-6cb94c6 INFO[0007] Creating a File-Based Catalog of the bundle "quay.io/operatorhubio/hive-operator:v2.5.3508-6cb94c6" INFO[0009] Generated a valid File-Based Catalog INFO[0012] Created registry pod: quay-io-operatorhubio-hive-operator-v2-5-3508-6cb94c6 INFO[0012] Created CatalogSource: hive-operator-catalog INFO[0012] OperatorGroup "operator-sdk-og" created INFO[0012] Created Subscription: hive-operator-v2-5-3508-6cb94c6-sub INFO[0014] Approved InstallPlan install-qzp6x for the Subscription: hive-operator-v2-5-3508-6cb94c6-sub INFO[0014] Waiting for ClusterServiceVersion "default/hive-operator.v2.5.3508-6cb94c6" to reach 'Succeeded' phase INFO[0014] Waiting for ClusterServiceVersion "default/hive-operator.v2.5.3508-6cb94c6" to appear INFO[0025] Found ClusterServiceVersion "default/hive-operator.v2.5.3508-6cb94c6" phase: Pending INFO[0026] Found ClusterServiceVersion "default/hive-operator.v2.5.3508-6cb94c6" phase: Installing INFO[0059] Found ClusterServiceVersion "default/hive-operator.v2.5.3508-6cb94c6" phase: Succeeded INFO[0060] OLM has successfully installed "hive-operator.v2.5.3508-6cb94c6"
@rashmigottipati https://github.com/operator-framework/operator-sdk/issues/5616 was not fixed. Still struggling to upgrade
operator-sdk run bundle quay.io/community-operators-pipeline/flux:v0.25.2
operator-sdk run bundle-upgrade quay.io/operator_testing/flux:testing0.25.3
or
operator-sdk run bundle quay.io/community-operators-pipeline/apicurito:v1.0.2
operator-sdk run bundle-upgrade quay.io/operator_testing/apicurito:testing-apicurito.v1.0.3
or even production
operator-sdk run bundle quay.io/operatorhubio/aqua:v2022.4.3
operator-sdk run bundle-upgrade quay.io/operatorhubio/aqua:v2022.4.4
/assign
So I did some digging and here are my findings:
When using:
operator-sdk run bundle quay.io/community-operators-pipeline/flux:v0.25.2
It was able to successfully install the operator on the cluster.
When going to upgrade the bundle with:
operator-sdk run bundle-upgrade quay.io/operator_testing/flux:testing0.25.3
The command stalls and never completes. The bundle is not properly upgraded.
Doing some further debugging I was able to determine that the point in the source that stalls when running operator-sdk run bundle-upgrade ... is when rendering the refs when attempting to upgrade the FBC here:
https://github.com/operator-framework/operator-sdk/blob/76ac4ba5d7d77590ee0e973c5b97595fcf447495/internal/olm/operator/registry/index_image.go#L239
This filters down to the fbcutil.RenderRefs() function stalling when it calls containerdregistry.NewRegistry() here:
https://github.com/operator-framework/operator-sdk/blob/76ac4ba5d7d77590ee0e973c5b97595fcf447495/internal/olm/fbcutil/util.go#L136-L139
Debugging even further resulted in noticing that when using the operator-framework/operator-registry library to create the new registry it uses the package https://pkg.go.dev/go.etcd.io/bbolt and attempts to Open() a database and stalls here:
https://github.com/operator-framework/operator-registry/blob/a3c883e9beee343bd55fd73c1447ea5e98459951/pkg/image/containerdregistry/options.go#L71-L75
Debugging even deeper down the dependency tree I found that the point in which everything is stalling is due to the attempt to lock the .db file in the bbolt library here: https://github.com/etcd-io/bbolt/blob/fd5535f71f488dda0915f610b6ca8c77c9ca2c59/db.go#L223-L233
We can see that the flock() function has the ability to set a timeout and if one is not specified it will infinitely loop at an interval of 50ms in an attempt to get a lock on the file. This can be seen here in the flock() functions implementation: https://github.com/etcd-io/bbolt/blob/fd5535f71f488dda0915f610b6ca8c77c9ca2c59/bolt_unix.go#L15-L45
As far as I could tell, the problem was able to be resolved when the name of the .db file used to create the registry is different (right now it is always the same default value of cache/metadata.db. This makes me think that we will need to make some updates to operator-framework/operator-registry to allow for the option to modify:
- The timeout that is used when attempting to create a new registry
- The DB file path to be different than the default to prevent file locking issues
Once those are done we can attempt to make modifications to the FBC upgrade logic that takes advantage of the new functionality.
@jmrodri since you were also looking a bit into this, WDYT?
So the above comment is definitely an issue at the moment because when a fix is applied I am able to use operator-sdk run bundle and operator-sdk run bundle-upgrade to successfully run and upgrade a valid bundle.
@J0zi I suspect for the production operators you mentioned running:
operator-sdk run bundle quay.io/operatorhubio/aqua:v2022.4.3 operator-sdk run bundle-upgrade quay.io/operatorhubio/aqua:v2022.4.4
That you encountered the issue of the command just hanging. Is this correct?
After fixing the command it no longer stalls and is able to run and upgrade that production operator no problem.
As far as the other commands I noticed something else (after fixing the command stalling problem) - the default channel being used in the images from quay.io/operator_testing/... have an entirely different default channel that is used than that of the images from quay.io/community-operators-pipeline/.... Due to this operator-sdk run bundle-upgrade is successfully upgrading the bundle and registry pod, however it times out when waiting for an InstallPlan to be created that it can approve. This is because the original Subscription created with the operator-sdk run bundle command is looking for changes in the stable channel but the FBC generated by operator-sdk run bundle-upgrade sets the upgrade path in the optest channel while the Subscription is not updated and it still looking for upgrade paths in the stable channel - therefore nothing happens and the command times out.
I suspect that you aren't meant to be able to start with the community released version of the operator (i.e quay.io/community-operators-pipeline/flux:v0.25.2) and upgrade to the version used for testing (i.e quay.io/operator_testing/flux:testing0.25.3) due to the fact that have deliberately different default channels that are used (stable and optest respectively).
I hope this makes sense and clears up any confusion - if not, please let me know and I can put together a more detailed response with examples. I will work on fixing the issue where the operator-sdk run bundle-upgrade command stalls indefinitely, but once that is fixed it should be working as expected.
So we are waiting for another release of operator-sdk to test it out and we could be unblocked then.
@everettraven @rashmigottipati please reopen
I have tested it and cannot say if update issue was solved because even run bundle is now broken. So we cannot test operator upgrade and this remains a blocker. We cannot run any bundle at all.
➜ operator-sdk run bundle quay.io/community-operators-pipeline/flux:v0.25.2
INFO[0013] Creating a File-Based Catalog of the bundle "quay.io/community-operators-pipeline/flux:v0.25.2"
INFO[0014] Generated a valid File-Based Catalog
INFO[0032] Created registry pod: quay-io-community-operators-pipeline-flux-v0-25-2
INFO[0032] Created CatalogSource: flux-catalog
INFO[0032] OperatorGroup "operator-sdk-og" created
INFO[0032] Created Subscription: flux-v0-25-2-sub
INFO[0034] Approved InstallPlan install-92w77 for the Subscription: flux-v0-25-2-sub
INFO[0034] Waiting for ClusterServiceVersion "default/flux.v0.25.2" to reach 'Succeeded' phase
INFO[0034] Waiting for ClusterServiceVersion "default/flux.v0.25.2" to appear
INFO[0082] Found ClusterServiceVersion "default/flux.v0.25.2" phase: Pending
INFO[0085] Found ClusterServiceVersion "default/flux.v0.25.2" phase: InstallReady
INFO[0086] Found ClusterServiceVersion "default/flux.v0.25.2" phase: Installing
FATA[0120] Failed to run bundle: error waiting for CSV to install: timed out waiting for the condition
➜ operator-sdk run bundle quay.io/operatorhubio/aqua:v2022.4.3
INFO[0013] Creating a File-Based Catalog of the bundle "quay.io/operatorhubio/aqua:v2022.4.3"
INFO[0014] Generated a valid File-Based Catalog
INFO[0021] Created registry pod: quay-io-operatorhubio-aqua-v2022-4-3
INFO[0021] Created CatalogSource: aqua-catalog
INFO[0021] Created Subscription: aqua-operator-v2022-4-3-sub
INFO[0024] Approved InstallPlan install-25nf7 for the Subscription: aqua-operator-v2022-4-3-sub
INFO[0024] Waiting for ClusterServiceVersion "default/aqua-operator.v2022.4.3" to reach 'Succeeded' phase
INFO[0024] Waiting for ClusterServiceVersion "default/aqua-operator.v2022.4.3" to appear
INFO[0056] Found ClusterServiceVersion "default/aqua-operator.v2022.4.3" phase: Failed
FATA[0056] Failed to run bundle: error waiting for CSV to install: csv failed: reason: "UnsupportedOperatorGroup", message: "AllNamespaces InstallModeType not supported, cannot configure to watch all namespaces"
➜ operator-sdk run bundle quay.io/community-operators-pipeline/apicurito:v1.0.2
INFO[0020] Creating a File-Based Catalog of the bundle "quay.io/community-operators-pipeline/apicurito:v1.0.2"
INFO[0021] Generated a valid File-Based Catalog
INFO[0028] Created registry pod: quay-io-community-operators-pipeline-apicurito-v1-0-2
INFO[0028] Created CatalogSource: apicurito-catalog
INFO[0028] Created Subscription: apicurito-v1-0-2-sub
INFO[0031] Approved InstallPlan install-wkgf2 for the Subscription: apicurito-v1-0-2-sub
INFO[0031] Waiting for ClusterServiceVersion "default/apicurito.v1.0.2" to reach 'Succeeded' phase
INFO[0031] Waiting for ClusterServiceVersion "default/apicurito.v1.0.2" to appear
INFO[0063] Found ClusterServiceVersion "default/apicurito.v1.0.2" phase: Failed
FATA[0063] Failed to run bundle: error waiting for CSV to install: csv failed: reason: "UnsupportedOperatorGroup", message: "AllNamespaces InstallModeType not supported, cannot configure to watch all namespaces"
operator-sdk version
operator-sdk version: "v1.25.0", commit: "3d4eb4b2de4b68519c8828f2289c2014979ccf2a", kubernetes version: "1.25.0", go version: "go1.19.2", GOOS: "linux", GOARCH: "amd64"
To successfully fix all issues following should work https://github.com/operator-framework/operator-sdk/issues/5773#issuecomment-1243412368
Reopening as per @J0zi
@J0zi I will take another look at this and report back what I find
@J0zi So I took a look at this and ran all the same commands you did and for the most part ran into the same errors. The only exception I found was that I was able to successfully install flux with operator-sdk run bundle quay.io/community-operators-pipeline/flux:v0.25.2 with a fresh KinD cluster.
My suspicion as to why the subsequent operator-sdk run bundle ... commands are failing is due to the creation of a single OperatorGroup resource for running all operators in the same namespace. The error FATA[0056] Failed to run bundle: error waiting for CSV to install: csv failed: reason: "UnsupportedOperatorGroup", message: "AllNamespaces InstallModeType not supported, cannot configure to watch all namespaces" makes me think that for whatever reason, using the same the same OperatorGroup every time is causing an issue.
I'm planning to investigate this further by changing the default behavior to create a new OperatorGroup for each operator and see if that resolves it.
To confirm that the aqua operator was able to be run successfully in a namespace without that OperatorGroup I created a new namespace and then used run bundle to install it:
bpalmer@bpalmer ~ kubectl create ns aqua-operator
namespace/aqua-operator created
bpalmer@bpalmer ~ operator-sdk run bundle quay.io/operatorhubio/aqua:v2022.4.3 -n aqua-operator
INFO[0003] Creating a File-Based Catalog of the bundle "quay.io/operatorhubio/aqua:v2022.4.3"
INFO[0004] Generated a valid File-Based Catalog
INFO[0006] Created registry pod: quay-io-operatorhubio-aqua-v2022-4-3
INFO[0006] Created CatalogSource: aqua-catalog
INFO[0006] OperatorGroup "operator-sdk-og" created
INFO[0007] Created Subscription: aqua-operator-v2022-4-3-sub
INFO[0010] Approved InstallPlan install-5khks for the Subscription: aqua-operator-v2022-4-3-sub
INFO[0010] Waiting for ClusterServiceVersion "aqua-operator/aqua-operator.v2022.4.3" to reach 'Succeeded' phase
INFO[0010] Waiting for ClusterServiceVersion "aqua-operator/aqua-operator.v2022.4.3" to appear
INFO[0019] Found ClusterServiceVersion "aqua-operator/aqua-operator.v2022.4.3" phase: Pending
INFO[0023] Found ClusterServiceVersion "aqua-operator/aqua-operator.v2022.4.3" phase: Installing
INFO[0029] Found ClusterServiceVersion "aqua-operator/aqua-operator.v2022.4.3" phase: Succeeded
INFO[0029] OLM has successfully installed "aqua-operator.v2022.4.3"
@J0zi So doing another bit of investigation, it seems it is not possible to have multiple OperatorGroups in a single namespace or else all ClusterServiceVersions will enter a failed state as mentioned in the OLM OperatorGroup Documentation.
This makes me think that installing all of these particular operators in the same namespace would have failed anyways because the flux operator supports the AllNamespace install mode while both aqua and apicurito do not support that install mode.
I think the solution in this case is to install both the aqua and apicurito operators into a different namespace so that the OperatorGroup created by operator-sdk run bundle is configured to work with their supported install modes.
I hope this helps!
@everettraven thank you very much for your investigation. I tested multiple operators and upgrade is working. So we can implement it to our pipelines :) Thank you again, you can close the issue.
Closing as per https://github.com/operator-framework/operator-sdk/issues/5773#issuecomment-1296962584
@everettraven we encountered following issue with upgrade:
operator-sdk run bundle quay.io/operatorhubio/strimzi-kafka-operator:v0.31.1 -n testupgrade --skip-tls-verify
...
INFO[0119] OLM has successfully installed "strimzi-cluster-operator.v0.31.1"
operator-sdk run bundle-upgrade quay.io/operatorhubio/strimzi-kafka-operator:v0.32.0 -n testupgrade --skip-tls-verify
INFO[0001] Found existing subscription with name strimzi-cluster-operator-v0-31-1-sub and namespace testupgrade
INFO[0001] Found existing catalog source with name strimzi-kafka-operator-catalog and namespace testupgrade
INFO[0014] Generated a valid Upgraded File-Based Catalog
FATA[0014] Failed to run bundle upgrade: update catalog error: error creating registry: error building registry pod definition: configMap error: error updating ConfigMap: ConfigMap "operator-sdk-run-bundle-config" is invalid: []: Too long: must have at most 1048576 bytes
@J0zi This issue specifically seems to be related to the bundle being to large to fit into a ConfigMap and with the new FBC format we have made some changes to operator-sdk run bundle-upgrade that attempts to mount the FBC for the upgrade as a ConfigMap.
This is something that we have seen before. IIRC we did implement a change that helps alleviate this slightly but it isn't perfect and is still prone to this problem.
Would you mind opening a new issue with this problem so we can track it separately and have it show up in our next issue triage meeting? This will help it get some more visibility and allow us to have some further discussion on how we can attempt to resolve this.
@everettraven we will continue here https://github.com/operator-framework/operator-sdk/issues/6144