operator-sdk
operator-sdk copied to clipboard
Don't override relatedImages in config CSV.
Bug Report
Looks like https://github.com/operator-framework/operator-sdk/issues/5000 has reappeared after related image discovery features were added. This has caused us to change how we specify relatedImages in our operator https://github.com/openshift/oadp-operator/pull/694
Seems to be sideeffect from related image discovery feature. Not sure which was the first PR for that.
Related PRs: https://github.com/operator-framework/operator-sdk/pull/5567
What did you do?
operatorsdk generate bundle.
What did you expect to see?
relatedImages in config CSV preserved
What did you see instead? Under which circumstances?
relatedImages in config CSV removed from generated bundle
Environment
Operator type:
/language go
$ operator-sdk version
operator-sdk version: "v1.20.0", commit: "deb3531ae20a5805b7ee30b71f13792b80bd49b1", kubernetes version: "v1.23", go version: "go1.18.1", GOOS: "darwin", GOARCH: "amd64"
$ go version
(if language is Go)
go version go1.18.1 darwin/amd64
$ kubectl version
Possible Solution
Don't override but append to existing relatedImages.
Additional context
/kind bug
Same issue in v1.19.1
. Can confirm this issue does NOT happen in v1.9.0
.
Thanks @blindenvy for the investigation. Do you know if 1.9 is the latest that worked? or just what you tried.
Looks like my fix at #5765 maybe put on pause until they discuss use case further. If you could add a comment or two there that would be awesome.
So I confirmed that the related image discovery feature does overwrite the base CSV related images. This was an unintended breaking change that I did not notice when implementing that feature so was not documented correctly. The current official way to define related Images is with environment variables that start with RELATED_IMAGE_
in the manager.yaml
file. These will be discovered and added to the CSV spec then the application can access the image reference by reading RELATED_IMAGE_MYIMAGE
from the environment. If possible, anyone experiencing this issue should move their related image definitions to the manager.yaml
file or any other container environment that needs access to related images. More details are in the latest documentation.
Since this is a breaking change, we will be doing further investigation on a resolution. As I said in the PR comments, we will at the very least get a documentation change up with migration instructions. Ideally this should be unbroken, but figuring out how to do it correctly without breaking the de-dup logic will require further investigation/discussion, which we can use this issue for.
The root issue is that the de-dup logic handles the following kinds of duplicates:
- Same name and image: Just collapsed into one entry. This could happen if you have the same related image in two containers. Both containers could need the environment variable, but you only need one entry in the related images.
- Same name and different image: Prefix the image name with its source container/deployment. This happens if two containers use the same name to represent a different image. This one specifically breaks with the fix proposed in #5765 because related images defined in the baseCSV lose the container/deployment name.
- Different name and same image: We only want each image to be listed once so if the same image is referenced by multiple names we cannot determine what name to use so we notify the user with a warning message and use an empty name.
The path of least resistence to me seems to be writing entirely separate logic for bringing over related images from the base CSV and only adding them to the final CSV if the images don't exist. So that way if you have no related images defined in manager.yaml
it will copy them all over, if you have only related images defined in manager.yaml
it will continue to work, and if you have a mix, it will output the union of both sets.
The only outstanding question I have is what to do with conflicts across base CSV and manager environment. For example, if we have a base CSV with:
relatedImages:
- name: myImage
image: some-image:latest
and a manager.yaml with:
env:
- name: RELATED_IMAGE_NOTMYIMAGE
value: some-image:latest
What name would we want for some-image:latest
, maintain myImage
from the baseCSV, overwrite it with notmyimage
from manager.yaml
, or allow duplicate images in this one specific case?
Similar issue if the image
overlaps but the name is different. Do we empty string out the name? Do we also allow duplicates here?
My gut says we can view definitions in the base CSV as "overrides" so they always take precedent over what we discover. So for each related image in the base CSV:
- If a related image with the same name/image was discovered, do nothing. Its already there.
- If a related image with the same name was discovered, prefix the discovered name with its container name, leave the base CSV definition as is and add it to the final CSV.
- If a related image with the same image was discovered, just add it in. Maybe warn the user an image is used twice.
- If there's no collision, just append it.
What do you think?
CC: @joelanford
I don't mind either. I err on the preference of preferring notmyimage
from RELATED_IMAGE_
overriding the baseCSV since you're probably in the process of migrating and are aware that ENV RELAED_IMAGE_
affects your final relatedImage
.
Even if you don't know about RELATED_IMAGE_
in ENV, it would be hard to miss scrolling through CSV.
Otherwise leave baseCSV relatedImage
as is if not specified in ENV.
I think there's a good case for both. It really comes down to if we want to say definitions in the base CSV are overrides or a deprecated way to define relatd images that we still support. If its the former, the base CSV definitions should take precedence and vise versa for the latter.
overrides or a deprecated
I think sticking with overrides will be best. Which overrides which is up for discussion.
Deprecation should be introduced slowly.
By deprecated, I just meant that we don't officially recommend defining related images in the base CSV, not marking it for removal or anything.
What I'm asking is if we say that base CSV related images are treated as overrides then that becomes a happy path that we expect users to use when defining related images via env variables does not work for them. If we say that discovered related images are the official way to do it, but are backwards compatible with base CSV related images then we would want discoverd related images to override base CSV related images.
So it comes down to whether or not a situation exists where a user using the latest version would need to explicitly use base CSV related images instead of defining them in the environment. If such a situation exists then we should implement/document base CSV related images as overrides that the user should use in that situation. Otherwise, we should just add backwards compatibility so users who defined related images the old way are not interrupted until they switch to the new way.
I think we should specifically recommend only the env-based related image detection. If there are csv.spec.relatedImages
in the base CSV manifest, I think we should:
- Give them the lowest precedence when choosing names or de-duping.
- Output a warning along the lines of "we found related images in your base CSV. we preserved them in the generated CSV, but your operator is at risk for supporting disconnected environments"
So I think that logic would look like the following when comparing discovered vs base:
- same image, regardless of whether the names differ: use the discovered name/image pair in the output.
- same name with a different image: keep discovered name, prefix base name (with
static-
orbase-
or something along those lines). - name/image pair not found in discovered set: add base name/image pair to output.
I need @tlwu2013 to check my math on item 2, but it seems like you need a RELATED_IMAGE_* environment variable for each other image you expect to use so that when your bundle is built, the pipeline can correctly inject digest-based image references for every image your operator uses.
is at risk for supporting disconnected environments
How so? planned removal soon after?
@joelanford I am in favor of this approach in general. The only thing that I disagree with is giving precedence to discovered related image when its name conflicts with a base CSV related image with a different image. Since related image name/image collision detection is a new feature, applying it to the base CSV related image would break backwards compatibility. Take for example the following base CSV snippet:
relatedImages:
- name: myimage
value: memcached:latest
And this environment in the manager deployment in manager.yaml
:
env:
- name: RELATED_IMAGE_MYIMAGE
value: memcached:alpine.
If we run an old version without this feature, the output CSV will have only the base CSV's related image verbatim:
relatedImages:
- name: myimage
value: memcached:latest
Then if we run a new version with this feature, the output will have the base CSV's related image with a different name along with the discovered related image:
relatedImages:
- name: base-myimage
value: memcached:latest
- name: myimage
value: memcached:alpine
This would break anything that depends on that name value in the output CSV. This is one of a few ways to get the name of base CSV related image to be different between versions. In order to ensure total backwards compatiblity, the base CSV related images need to be preserved entirely so they should take precedence over discovered related images when resolving name conflicts.
Does this make sense to you?
@kaovilai
is at risk for supporting disconnected environments
Maybe what I'm thinking about is orthogonal. This is why I'm asking for folks to check my math. Anyways, here goes.
If you manually specify a relatedImage in the CSV, there's no guarantee that you've also correctly created a RELATED_IMAGE_
environment variable for it in your deployment. When a pipeline goes and builds your operand images, it needs to inform your controller what those images are by digest. The way it does that is via the RELATED_IMAGE_
environment variables.
Hence you might list a relatedImage in your CSV statically, but when your operator tries to use that image in a workload in a disconnected environment, that upstream image name might not work because there may not be an ICSP that maps it to the disconnected mirror.
@ryantking - Yeah makes sense to me. I hadn't considered backcompat issue with the name changing. In all likelihood, there's nothing that actually depends on that name since I know OLM and mirroring tooling ignore it. But there could be some operator author tooling that uses it.
So I'm plus +1 to that tweak, though I still think it makes sense to issue a warning about the statically defined related images. If we want to be more vague about why you shouldn't do that, we could just say that specifying static related images is deprecated, that you should start using RELATED_IMAGE_
envs, and we'll remove persistence of static related images is OSDK v2.
Hence you might list a relatedImage in your CSV statically
Only in upstream builds. for downstream @rayfordj @weshayutin has scripts that replace upstream CSV relatedImages with downstream image by digest.
Looks like we are in the process of moving towards RELATED_IMAGE_ long term as well per (RH-only) https://issues.redhat.com/browse/OADP-296
@joelanford Let me know if https://github.com/operator-framework/operator-sdk/pull/5765 is currently ok. I may have some confusion around below.
If you manually specify a relatedImage in the CSV, there's no guarantee that you've also correctly created a RELATED_IMAGE_ environment variable for it in your deployment. When a pipeline goes and builds your operand images, it needs to inform your controller what those images are by digest. The way it does that is via the RELATED_IMAGE_ environment variables.
Hence you might list a relatedImage in your CSV statically, but when your operator tries to use that image in a workload in a disconnected environment, that upstream image name might not work because there may not be an ICSP that maps it to the disconnected mirror.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten /remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen
. Mark the issue as fresh by commenting/remove-lifecycle rotten
. Exclude this issue from closing again by commenting/lifecycle frozen
./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/remove-lifecycle rotten