controller
controller copied to clipboard
mitigate the number of times we tag an image
According to @lavalamp, Workflow clusters in GKE are triggering this bug. This issue occurs because Workflow will re-tag an image even if no changes were made. We used to do this because we would inject environment metadata into the image when deis config:set was called. Now, we just inject that environment metadata directly into the k8s manifest, so no image modification (and therefore no image re-tagging) is required.
While the upstream bug will eventually get fixed, in the meantime we should investigate how we trigger this bug, and see if there is a way to mitigate this issue.
Doing kubectl get node -o yaml on one of the offending clusters shows a long list of re-tagged app images:
- names:
- gcr.io/google_containers/pause:2.0
sizeBytes: 350164
- names:
- localhost:5555/test-57368379:v3
- localhost:5555/test-225171282:v3
- localhost:5555/test-349910451:v2
- localhost:5555/test-765152779:v2
- localhost:5555/test-531894577:v2
- localhost:5555/test-531894577:v3
- localhost:5555/test-957102600:v2
- localhost:5555/test-957102600:v3
- localhost:5555/test-101875602:v2
- localhost:5555/test-736008182:v2
- localhost:5555/test-312165996:v2
- localhost:5555/test-312165996:v3
- localhost:5555/test-966490384:v2
- localhost:5555/test-92413502:v2
- localhost:5555/test-953282707:v2
- localhost:5555/test-578673730:v4
- localhost:5555/test-846984163:v2
- localhost:5555/test-19736339:v3
- localhost:5555/test-539512357:v2
- localhost:5555/test-316118051:v2
- localhost:5555/test-457125150:v2
- localhost:5555/test-270332264:v2
- localhost:5555/test-270332264:v3
- localhost:5555/test-84933414:v2
- localhost:5555/test-84933414:v3
- localhost:5555/test-241757411:v2
- localhost:5555/test-241757411:v3
- localhost:5555/test-778712360:v2
- localhost:5555/test-778712360:v3
- localhost:5555/test-778712360:v4
- localhost:5555/test-848623511:v3
- localhost:5555/test-159686861:v2
... [couple thousand lines omitted]
So I'd argue this isn't an immediate problem with Deis Workflow itself; rather it's an artifact of CI running many times and re-tagging similar images.
One approach to try here is garbage-collecting old images and tags when e2e runs. If that works out, we could investigate doing something similar within Workflow itself, since it is conceivable a similar problem could happen through normal usage over long periods of time.
We could be smarter in workflow about when we tag a new image - check if the pulled down image matches the last available one, if it does then we'd not increment. This changes Release::image logic quite a bit tho.
Also only helps reduce the problem, doesn't do any sort of cleanup operation.
If we started doing some GC of images, that means that rolling back to releases where the image has been GC'ed is now out of bounds.
I think we fixed the original issue in deis/charts#352 so this ticket is very low priority at this point.
If we started doing some GC of images, that means that rolling back to releases where the image has been GC'ed is now out of bounds.
Isn't the image pulled into the local registry for the most part, though? Asides for private images using deis registry and deis pull, if we GC the local docker cache on the worker nodes, the image should still be present within the local private registry which the worker nodes can fetch from.
My train of thought is to stop tagging buildpack images as v2/v3/v4 as we continue to perform changes to the image through deis config:set and just push :git-sha as the tag on a buildpack push, which is never updated until a new build is pushed to the registry via git push deis master
In either case, still a micro optimization so this is really low priority and no fix is really necessary in the immediate future. The main issue was what @mboersma mentioned, which was that our e2e clusters weren't cleaning up app images from old e2e runs.
Yeah, it should be in the local registry, good point.
I think the bigger issue would be deis pull rather than git push as the latter has sha available as you pointed out. Build Packs already are downloaded as slugs using the slugrunner image and Dockerfile apps use a :git-<sha> tag on images as it stands
just removing this from the milestone since the root issue was fixed in deis/charts#352, which was merged in for v2.7
This issue was moved to teamhephy/controller#50