fleet icon indicating copy to clipboard operation
fleet copied to clipboard

[v0.10] Backport of Updating Ports with correctDrift enabled using multiple-paths repo triggers an error

Open weyfonk opened this issue 1 year ago • 1 comments

This is a backport of #2609 to v0.10.

weyfonk avatar Sep 11 '24 15:09 weyfonk

(copied from #2609)

Additional QA

Problem

When failing to correct drift on a resource (eg. modified ports array on a service), Fleet would leave a GitRepo in Modified state, with no error on the corresponding bundle deployment status.

Solution

  • The drift correction error is now reflected in the bundle deployment status, which should in turn propagate it to the GitRepo status
  • Setting force: true on the GitRepo resolves the error by deleting and recreating the Helm release for the bundle deployment, hence recreating the service in this case.

Testing

See reproduction steps above, in the issue description.

Engineering Testing

Manual Testing

  1. Created a GitRepo with drift correction enabled (but not set to force mode) pointing to rancher/fleet-test-data's multiple-paths
  2. Edited the created service ports
  3. Checked status of the GitRepo and bundle deployment
  4. Updated the GitRepo drift correction mode to true
  5. Saw the GitRepo and bundle deployment status error disappear, once the service had been recreated.

Automated Testing

  • Integration tests cover propagation of the drift correction error to the bundle deployment status
  • End-to-end tests verify that when a bundle is marked as modified, patching a GitRepo to set its correctDrift.force option to true eventually updates the bundle status, in that the bundle will no longer appear as modified.

QA Testing Considerations

  • Test how the GitRepo status is reflected in the Rancher UI

Regressions Considerations

N/A

weyfonk avatar Sep 16 '24 15:09 weyfonk

I am still observing this in Rancher v2.9-2d10b66bb2e1e5fc7568591ba41648002cf29b20-head with fleet:104.1.0+up0.10.4-rc.1 following reproduction steps on original ticket.

@weyfonk, am I perhaps not looking at something well or is it still this fix not being propagated into the above-mentioned fleet version?

image

Fleet reconciliation error log
{
  "level": "error",
  "ts": "2024-10-04T14:28:04Z",
  "logger": "bundledeployment",
  "msg": "Failed to deploy bundle",
  "controller": "bundledeployment",
  "controllerGroup": "fleet.cattle.io",
  "controllerKind": "BundleDeployment",
  "BundleDeployment": {
    "name": "ds-cluster-correct-79-multiple-paths-service",
    "namespace": "cluster-fleet-default-imported-0-f1158df57e01"
  },
  "namespace": "cluster-fleet-default-imported-0-f1158df57e01",
  "name": "ds-cluster-correct-79-multiple-paths-service",
  "reconcileID": "0b51e68c-b581-44cb-8a9d-4a00bc09c8ae",
  "status": {
    "conditions": [
      {
        "type": "Installed",
        "status": "True",
        "lastUpdateTime": "2024-10-04T14:09:56Z"
      },
      {
        "type": "Deployed",
        "status": "False",
        "lastUpdateTime": "2024-10-04T14:11:18Z",
        "reason": "Error",
        "message": "cannot patch \"mp-app-service\" with kind Service: Service \"mp-app-service\" is invalid: spec.ports[1].name: Duplicate value: \"required-name2\""
      },
      {
        "type": "Ready",
        "status": "True",
        "lastUpdateTime": "2024-10-04T14:10:04Z"
      },
      {
        "type": "Monitored",
        "status": "True",
        "lastUpdateTime": "2024-10-04T14:09:56Z"
      }
    ],
    "appliedDeploymentID": "s-e900fb60b86d8593e95a733a0c0d1794f2d71a00910f794d19bcd4d57deca:aa73273923fd2b194b95dc51be330a7b1be92dafa689e0afb400abda8b37d8c0",
    "release": "test-fleet-mp-service/ds-cluster-correct-79-multiple-paths-service:1",
    "ready": true,
    "nonModified": true,
    "display": {
      "deployed": "Error: cannot patch \"mp-app-service\" with kind Service: Service \"mp-app-service\" is invalid: spec.ports[1].name: Duplicate value: \"required-name2\"",
      "monitored": "True",
      "state": "ErrApplied"
    },
    "syncGeneration": 0
  },
  "error": "cannot patch \"mp-app-service\" with kind Service: Service \"mp-app-service\" is invalid: spec.ports[1].name: Duplicate value: \"required-name2\"",
  "errorVerbose": "cannot patch \"mp-app-service\" with kind Service: Service \"mp-app-service\" is invalid: spec.ports[1].name: Duplicate value: \"required-name2\"\nhelm.sh/helm/v3/pkg/kube.(*Client).Update\n\t/home/runner/go/pkg/mod/github.com/rancher/helm/[email protected]/pkg/kube/client.go:438\nhelm.sh/helm/v3/pkg/action.(*Install).performInstall\n\t/home/runner/go/pkg/mod/github.com/rancher/helm/[email protected]/pkg/action/install.go:456\nhelm.sh/helm/v3/pkg/action.(*Install).performInstallCtx.func1\n\t/home/runner/go/pkg/mod/github.com/rancher/helm/[email protected]/pkg/action/install.go:421\nruntime.goexit\n\t/home/runner/go/pkg/mod/golang.org/[email protected]/src/runtime/asm_amd64.s:1695",
  "stacktrace": "github.com/rancher/fleet/internal/cmd/agent/controller.(*BundleDeploymentReconciler).Reconcile\n\t/home/runner/work/fleet/fleet/internal/cmd/agent/controller/bundledeployment_controller.go:131\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222"
}
{
  "level": "error",
  "ts": "2024-10-04T14:28:04Z",
  "msg": "Reconciler error",
  "controller": "bundledeployment",
  "controllerGroup": "fleet.cattle.io",
  "controllerKind": "BundleDeployment",
  "BundleDeployment": {
    "name": "ds-cluster-correct-79-multiple-paths-service",
    "namespace": "cluster-fleet-default-imported-0-f1158df57e01"
  },
  "namespace": "cluster-fleet-default-imported-0-f1158df57e01",
  "name": "ds-cluster-correct-79-multiple-paths-service",
  "reconcileID": "0b51e68c-b581-44cb-8a9d-4a00bc09c8ae",
  "error": "failed deploying bundle: cannot patch \"mp-app-service\" with kind Service: Service \"mp-app-service\" is invalid: spec.ports[1].name: Duplicate value: \"required-name2\"",
  "errorCauses": [
    {
      "error": "failed deploying bundle: cannot patch \"mp-app-service\" with kind Service: Service \"mp-app-service\" is invalid: spec.ports[1].name: Duplicate value: \"required-name2\""
    }
  ],
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222"
}

mmartin24 avatar Oct 04 '24 14:10 mmartin24

Thanks @mmartin24 for raising this.

I am still able to reproduce issues with updating service ports on Rancher v2.9.3-alpha4 with Fleet v0.10.4-rc.1. After a few seconds, although the bundle deployment containing the service appears as modified, the corresponding bundle sees its status updated to Ready, as if the bundle deployment were ready too. This in turn is reflected in the GitRepo owning that bundle. This happens because the resources fields are cleared from the bundle deployment's status. Why that happens is still unclear, although this code is a prime suspect.

weyfonk avatar Oct 11 '24 08:10 weyfonk

Confirmed: this line calls a DryRun on a Wrangler apply.Apply, which returns an empty set of objects. In turn, that set is used to populate resources in the bundle deployment status, which explains why those resources don't appear in the status from that point onwards.

Fixing this would require either:

  • troubleshooting this call to Wrangler code
  • or rewriting logic where Fleet truncates the list of resources in the bundle deployment status before re-populating it
  • skipping that logic when there is no data to re-populate it with
    • Edit: I have tried this, and while resources are no longer cleared in the bundle deployment status, its nonModified field is set to true. I thought that this might be caused by a conflict between the Fleet controller and agent over bundle deployment status updates, but this has been fixed in v0.10 as well.

weyfonk avatar Oct 28 '24 15:10 weyfonk

Tested in v2.9-1d1065cd5bf09c23834720420e1153712fa43439-head with fleet:104.1.1+up0.10.5-rc.2 and still errorring. Same issue as in https://github.com/rancher/fleet/issues/2609#issuecomment-2454409377.

Setting back to backlog

mmartin24 avatar Nov 04 '24 11:11 mmartin24

This was never in a RC

manno avatar Nov 15 '24 13:11 manno

Rechecked in 2.9 with Fleet fleet:104.1.1+up0.10.5-rc.2 and still occurring. Nevertheless agreed to close as it is a corner case (same as in https://github.com/rancher/fleet/issues/2609#issuecomment-2504051729)

mmartin24 avatar Nov 27 '24 14:11 mmartin24