image-automation-controller icon indicating copy to clipboard operation
image-automation-controller copied to clipboard

HelmRelease file not updated

Open derrickburns opened this issue 4 years ago • 10 comments

@stefanprodan As you know, I have been using your tools for a long time. My general experience is that the tools are rock solid.

Today a colleague reported that an image was not being updated. I inspected our cluster. The Image Automation controller had been running for 4 days. It had made updates as recently as last night. There were no error messages in the logs. In other words, from the perspective of the Image Automation controller, everything was running fine.

I looked for an ImagePolicy for the image in question. I found one. The policy had been updated to the new image hours earlier. The annotation to instruct the Image Automation controller was attached to a HelmRelease. That file had not been modified with the new image tag. However, that file had been updated by the image automation controller in the last 4 days.

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: mock-metadata-service
spec:
  interval: 5m
  chart:
    spec:
      chart: .
      version: "0.0.4"
      sourceRef:
        kind: GitRepository
        name: mock-metadata-service
        namespace: shared
      interval: 1m
  values:
    name: mock-metadata-service
    imagePullSecrets:
      - name: ecr-credentials-sync
    images:
      tag: 21.0811.1547.38-70e9909c-master # {"$imagepolicy": "shared:mock-metadata-service:tag"}

I restarted the Image Automation controller and it quickly updated the Gitlab repo to reflect the proper version of the Image tag.

So, I am left to conclude that either an error occurred that was not properly logged or there is another bug.

derrickburns avatar Aug 11 '21 16:08 derrickburns

Could this be related to https://github.com/fluxcd/image-automation-controller/issues/209? have you seen any timeout logs?

stefanprodan avatar Aug 12 '21 08:08 stefanprodan

@stefanprodan I think that this just happened again. Following is the data that I was able to collect.

derrickburns avatar Aug 12 '21 22:08 derrickburns

I see no timeout in the logs.

Here is the complete log: image-automation-log.txt

Here is the current date: date.txt

Here is the image policy: imagepolicy.txt

derrickburns avatar Aug 12 '21 22:08 derrickburns

Pod that lists image deployed:

mock-metadata-service.txt

derrickburns avatar Aug 12 '21 22:08 derrickburns

Here is the source of the Helmrelease:

kind: HelmRelease
metadata:
  name: mock-metadata-service
spec:
  interval: 5m
  chart:
    spec:
      chart: .
      version: "0.0.4"
      sourceRef:
        kind: GitRepository
        name: mock-metadata-service
        namespace: shared
      interval: 1m
  values:
    name: mock-metadata-service
    imagePullSecrets:
      - name: ecr-credentials-sync
    images:
      tag: 21.0810.1833.45-01134759-master # {"$imagepolicy": "shared:mock-metadata-service:tag"}

Here is the in cluster representation of the helm release:

helmrelease.txt

derrickburns avatar Aug 12 '21 22:08 derrickburns

Here is the image automation pod:

image-automation-pod.txt

derrickburns avatar Aug 12 '21 22:08 derrickburns

What would happen if the controller ran on a node that was very low of memory? Could it cause this failure?

derrickburns avatar Aug 17 '21 02:08 derrickburns

:wave: @derrickburns

If you set --log-level=debug on the controller deployment, the controller (in recent versions) will record much more about why it does or doesn't make any update. That might reveal if there's some subtle, or mistaken, reason it declines to commit the change you expected.

squaremo avatar Sep 15 '21 13:09 squaremo

Hi, We are experiencing similar problems. After quite some time working seamlessly, applications stop being updated automatically by flux. New image is detected, but no changes are commited to workload repo and application is not updated on cluster.

After forced image-automation-controller restart changes are commited and pushed to repo and application is updated.

We've found no information in logs that could tell us anything about the cause and about the problem itself.

bondido avatar Dec 13 '21 12:12 bondido

The image-automation controller version v0.21.0 introduces an experimental transport that fixes the issue in which the controller stops working in some specific scenarios.

The experimental transport needs to be opted-in by setting the environment variable EXPERIMENTAL_GIT_TRANSPORT to true in the controller's Deployment.

This will require a redeploy of all components so I would recommend doing so via flux bootstrap using the flux cli version v0.28.0 which will be released tomorrow.

Can you test it again with the experimental transport enabled and let us know how you get on please?

pjbgf avatar Mar 22 '22 17:03 pjbgf

Closing this issue due to inactivity, but happy to reopen in case of reincidence whilst using the latest versions of the image automation controller.

pjbgf avatar Sep 03 '22 19:09 pjbgf