argo-rollouts icon indicating copy to clipboard operation
argo-rollouts copied to clipboard

experiments: recovered from panic, on v1.4

Open Lykathia opened this issue 1 year ago • 10 comments

When running an experiment, a number of nil pointer exceptions are thrown - before the experiment eventually passes and moseys on its merry way.

We use istio in our environment.

It seems to resolve itself after a dozen+ failures.

To Reproduce

spec:
  analysis:
    successfulRunHistoryLimit: 1
    unsuccessfulRunHistoryLimit: 1
  progressDeadlineAbort: true
  replicas: 2
  revisionHistoryLimit: 5
  strategy:
    canary:
      steps:
        - experiment:
            analyses:
              - args:
                  - name: api-root-url
                    value: >-
                      http://example-preview.default.svc.cluster.local:8080
                  - name: request-timeout-milliseconds
                    value: '1000'
                name: example-smoke-test-analysis
                requiredForCompletion: true
                templateName: example-smoke-test-analysis
            duration: 5m
            templates:
              - metadata:
                  labels:
                    app.kubernetes.io/name: example-preview
                name: preview
                replicas: 1
                selector:
                  matchLabels:
                    app.kubernetes.io/name: example-preview
                specRef: canary
        - setWeight: 5
        - pause:
            duration: 10m
      trafficRouting:
        istio:
          destinationRule:
            canarySubsetName: canary
            name: example-mesh-destination
            stableSubsetName: stable
          virtualService:
            name: example-mesh-virtualservice
  workloadRef:
    apiVersion: apps/v1
    kind: Deployment
    name: example-svc-deployment

Expected behavior

The logs to not raise errors of nil pointer exceptions

Version

1.4

Logs

Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 325 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:149 +0x58
panic({0x21c1260, 0x3c0ff10})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).calculateWeightDestinationsFromExperiment(0xc00198ac00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:333 +0x1f7
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcileTrafficRouting(0xc00198ac00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:176 +0x7c5
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).rolloutCanary(0xc00198ac00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/canary.go:56 +0x166
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcile(0xc00198ac00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/context.go:86 +0xe7
github.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc0000e48c0, {0x29d81b0, 0xc000806c00}, {0xc0023f39b0, 0x27})
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:415 +0x4d3
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:153 +0x89
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1({0x29e4220?, 0xc000522520}, {0x25881fc, 0x7}, 0xc0009a7e70, {0x29d81b0, 0xc000806c00}, 0xc0005812c0?, {0x2093520, 0xc001bdde70})
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:157 +0x40b
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem({0x29d81b0, 0xc000806c00}, {0x29e4220, 0xc000522520}, {0x25881fc, 0x7}, 0x0?, 0x0?)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:171 +0xbf
github.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:104
github.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:336 +0xbe
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x29b80e0, 0xc00150ccf0}, 0x1, 0xc00017d0e0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x25
created by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:335 +0xa7

Lykathia avatar Feb 19 '23 00:02 Lykathia

@Lykathia Can you try v1.4.1

zachaller avatar Feb 25 '23 15:02 zachaller

@zachaller yep, PR in - will get it reviewed and tested on Monday and report back. Thanks!

Lykathia avatar Feb 25 '23 18:02 Lykathia

1.4.1 doesn't not appear to have addressed the issue.

Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 337 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:149 +0x58
panic({0x21c24a0, 0x3c11f30})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).calculateWeightDestinationsFromExperiment(0xc0019b8c00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:333 +0x1f7
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcileTrafficRouting(0xc0019b8c00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:176 +0x7c5
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).rolloutCanary(0xc0019b8c00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/canary.go:56 +0x166
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcile(0xc0019b8c00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/context.go:86 +0xe7
github.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc000210700, {0x29d99f0, 0xc0002ea100}, {0xc002159dd0, 0x27})
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:415 +0x4d3
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:153 +0x89
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1({0x29e5a60?, 0xc000512c20}, {0x25899bc, 0x7}, 0xc000e4de70, {0x29d99f0, 0xc0002ea100}, 0xc000317080?, {0x2094760, 0xc002dc11d0})
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:157 +0x40b
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem({0x29d99f0, 0xc0002ea100}, {0x29e5a60, 0xc000512c20}, {0x25899bc, 0x7}, 0x2094760?, 0x29b0250?)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:171 +0xbf
github.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:104
github.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:336 +0xbe
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x3332303622?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x3a22736c6562616c?, {0x29b9900, 0xc000d63590}, 0x1, 0xc00055d740)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x706d65742d646f70?, 0x3b9aca00, 0x0, 0x37?, 0x61746f6e6e61222c?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x6a6f72706f677261?, 0x636e79732f6f692e?, 0x223a22657661772d?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x25
created by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:335 +0xa7

Lykathia avatar Feb 27 '23 15:02 Lykathia

@Lykathia I would love to see your rollout object with the status field intact if you could share that it would be awesome. I also have a question if you are trying to use a weight on your experiment which I should be able to gather from the sharing of your ro object.

zachaller avatar Feb 27 '23 17:02 zachaller

Sure thing! Some nouns edited and some annotations / labels removed

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: '10'
    rollout.argoproj.io/revision: '46'
    rollout.argoproj.io/workload-generation: '52'
  creationTimestamp: '2022-06-03T17:50:51Z'
  generation: 39
  labels:
    app.kubernetes.io/instance: production-eks-prod-examplens-example
  name: example-rollout-c89ac2e8
  namespace: examplens
  resourceVersion: '614921626'
  uid: 86e21dfc-7adf-40f5-a9de-b5d368b0a843
spec:
  analysis:
    successfulRunHistoryLimit: 1
    unsuccessfulRunHistoryLimit: 1
  progressDeadlineAbort: true
  replicas: 2
  restartAt: '2022-11-09T19:40:00Z'
  revisionHistoryLimit: 5
  strategy:
    canary:
      steps:
        - experiment:
            analyses:
              - args:
                  - name: api-root-url
                    value: >-
                      http://example-rollout-preview.examplens.svc.cluster.local:8080
                  - name: request-timeout-milliseconds
                    value: '1000'
                name: example-smoke-test-analysis
                requiredForCompletion: true
                templateName: example-smoke-test-analysis
            duration: 5m
            templates:
              - metadata:
                  labels:
                    app.kubernetes.io/name: example-rollout-preview
                name: preview
                replicas: 1
                selector:
                  matchLabels:
                    app.kubernetes.io/name: example-rollout-preview
                specRef: canary
        - setWeight: 5
        - pause:
            duration: 10m
      trafficRouting:
        istio:
          destinationRule:
            canarySubsetName: canary
            name: example-mesh-destination-c833f030
            stableSubsetName: stable
          virtualService:
            name: example-mesh-virtualservice-c8e2aa97
  workloadRef:
    apiVersion: apps/v1
    kind: Deployment
    name: example-svc-deployment-c8e30830
status:
  HPAReplicas: 2
  availableReplicas: 2
  blueGreen: {}
  canary:
    weights:
      canary:
        podTemplateHash: 5ccdfd574c
        weight: 0
      stable:
        podTemplateHash: 5ccdfd574c
        weight: 100
  conditions:
    - lastTransitionTime: '2023-02-27T14:45:05Z'
      lastUpdateTime: '2023-02-27T14:45:05Z'
      message: Rollout has minimum availability
      reason: AvailableReason
      status: 'True'
      type: Available
    - lastTransitionTime: '2023-02-27T15:12:21Z'
      lastUpdateTime: '2023-02-27T15:12:21Z'
      message: Rollout is paused
      reason: RolloutPaused
      status: 'False'
      type: Paused
    - lastTransitionTime: '2023-02-27T15:12:32Z'
      lastUpdateTime: '2023-02-27T15:12:32Z'
      message: RolloutCompleted
      reason: RolloutCompleted
      status: 'True'
      type: Completed
    - lastTransitionTime: '2023-02-27T15:13:02Z'
      lastUpdateTime: '2023-02-27T15:13:02Z'
      message: Rollout is healthy
      reason: RolloutHealthy
      status: 'True'
      type: Healthy
    - lastTransitionTime: '2023-02-27T15:12:21Z'
      lastUpdateTime: '2023-02-27T15:13:02Z'
      message: >-
        ReplicaSet "example-rollout-c89ac2e8-5ccdfd574c" has successfully
        progressed.
      reason: NewReplicaSetAvailable
      status: 'True'
      type: Progressing
  currentPodHash: 5ccdfd574c
  currentStepHash: 75dcbd8f68
  currentStepIndex: 3
  observedGeneration: '39'
  phase: Healthy
  readyReplicas: 2
  replicas: 2
  restartedAt: '2022-11-09T19:40:00Z'
  selector: app.kubernetes.io/name=example-svc-c8333369
  stableRS: 5ccdfd574c
  updatedReplicas: 2
  workloadObservedGeneration: '52'

everything seems to complete fine and as expected, just the logs are exploding w/ the NPEs.

Lykathia avatar Feb 27 '23 18:02 Lykathia

This issue is stale because it has awaiting-response label for 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Mar 05 '23 02:03 github-actions[bot]

This issue is stale because it has been open 60 days with no activity.

github-actions[bot] avatar May 05 '23 02:05 github-actions[bot]

Maybe related https://github.com/argoproj/argo-rollouts/pull/2734

zachaller avatar May 05 '23 02:05 zachaller

This issue is stale because it has been open 60 days with no activity.

github-actions[bot] avatar Jul 05 '23 02:07 github-actions[bot]

I saw similar exceptions occurring when calculateWeightDestinationsFromExperiment was called. My guess was that Argo was calling this function to determine the weight the experiment should have and it was returning NULL or an unexpected object or something. Then Argo would try and adjust the Virtual Service weight to this value, and Istio's admission controller would reject the non-zero/positive integer value and Argo would fail to catch the exception gracefully.

I solved it by adding a weight: # to my experiment.

zimmertr avatar Jan 12 '24 00:01 zimmertr