image-builder icon indicating copy to clipboard operation
image-builder copied to clipboard

Race between kubeadm and containerd service in Flatcar

Open kopiczko opened this issue 3 years ago • 4 comments

What steps did you take and what happened: When booting a cluster with Flatcar Linux (occurred in CAPO) there can be a race between kubeadm.service and containerd.service:

ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

What did you expect to happen: containerd running before kubead starts.

Anything else you would like to add: Adding this to KubeadmConfigTemplate fixes it:

      format: ignition
      ignition:
        containerLinuxConfig:
          additionalConfig: |
            systemd:
              units:
              - name: kubeadm.service
                enabled: true
                dropins:
                - name: 10-flatcar.conf
                  contents: |
                    [Unit]
                    Requires=containerd.service
                    After=containerd.service

Environment:

Project (Image Builder for Cluster API, kube-deploy/imagebuilder, konfigadm):

Additional info for Image Builder for Cluster API related issues:

  • OS (e.g. from /etc/os-release, or cmd /c ver):
  • Packer Version:
  • Packer Provider:
  • Ansible Version:
  • Cluster-api version (if using):
  • Kubernetes version: (use kubectl version):

/kind bug

/cc @invidian @pothos @dongsupark @johananl

kopiczko avatar Jul 22 '22 16:07 kopiczko

This is fixed in the latest Flatcar release where containerd is enabled by default: https://www.flatcar.org/releases#release-3227.2.0

Edit: I thought it's about the enabling, but it's also about the After=, it seems

pothos avatar Jul 22 '22 16:07 pothos

I recall we've been discussing this with @jepio somewhere. Let me dig that up.

invidian avatar Jul 22 '22 16:07 invidian

I recall we've been discussing this with @jepio somewhere. Let me dig that up.

Hmm, I can't find it, but the point was, that the kubeadm.service is being provided by the CABPK and image-builder CAPI images defaults to use containerd. So I believe image-builder should make no assumptions about the bootstrap provider which will be used. And CABPK should make no assumptions that containerd will be used. The place to define the binding mentioned in the issue as a workaround should be done in the cluster templates, where image reference is binded with the selected bootstrap provider, which is done right now in the platform templates, so https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/templates/cluster-template-flatcar.yaml and in https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/1729.

invidian avatar Jul 22 '22 17:07 invidian

This fix is no longer working.

Issue states: unable to translate from 2.x to 3.x config has duplicate unit name kubeadm.service

ignition:
  containerLinuxConfig:
    additionalConfig: |
      storage:
        links:
        - path: /etc/systemd/system/kubeadm.service.wants/containerd.service
          target: /usr/lib/systemd/system/containerd.service

This worked

Jeremy-Boyle avatar Oct 04 '22 17:10 Jeremy-Boyle

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 02 '23 18:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 01 '23 19:02 k8s-triage-robot

/remove-lifecycle rotten

johananl avatar Mar 01 '23 14:03 johananl

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 30 '23 14:05 k8s-triage-robot

/remove-lifecycle stale

johananl avatar May 30 '23 21:05 johananl

I'll take this. Will try to reproduce using current Flatcar and CAPI versions and check if a fix is needed. /assign

johananl avatar Jul 24 '23 15:07 johananl

I can't reproduce the issue (tried on AWS as I don't have an OpenStack environment available). I would imagine that if this issue exists we should see it on all providers since the issue is infrastructure-agnostic.

Looks like a workaround has already been implemented for CAPO: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/47b3e15e91e40baadf2af676806a7d9fe1f005ef/templates/cluster-template-flatcar.yaml#L33-L34

Do you think we need to do anything else here @kopiczko or should we close this issue? If so, what changes would you suggest?

We can't do the change in CABPK because we want the same bootstrap process to be usable on distros which may use Ignition but don't use containerd. It may be worth thinking about being able to set distro-specific customizations such as this workaround project-wide, i.e. without having to add the same workaround on multiple templates on multiple infra provider repos. Something to think about in the context of https://github.com/kubernetes-sigs/cluster-api/issues/5294.

FWIW, I think it should be fine to include Requires and After directives in all cluster templates since kubeadm.service indeed depends on the container runtime, I'm just trying to figure out if it's necessary and worth the effort given that the issue seems hard to reproduce and given that we'd have to change things in all providers.

In case we do want to add the workaround everywhere, here are the places where we may have to make changes:

CAPA

  • https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/76e369a2692bdffadfb8218e394e364b053806fd/templates/cluster-template-flatcar.yaml#L67
  • https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/76e369a2692bdffadfb8218e394e364b053806fd/test/e2e/data/infrastructure-aws/withoutclusterclass/kustomize_sources/ignition/patches/worker-ignition.yaml#L19
  • https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/76e369a2692bdffadfb8218e394e364b053806fd/test/e2e/data/infrastructure-aws/withoutclusterclass/kustomize_sources/ignition/patches/control-plane-ignition.yaml#L23
  • https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/76e369a2692bdffadfb8218e394e364b053806fd/test/e2e/data/infrastructure-aws/withoutclusterclass/e2e_test_templates/cluster-template-ignition.yaml#L58

CAPZ

  • https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/79b63b505822b78c6f40f73a8ab09710a3385410/templates/cluster-template-flatcar.yaml#L88
  • https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/79b63b505822b78c6f40f73a8ab09710a3385410/templates/flavors/flatcar/machine-deployment.yaml#L64
  • https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/79b63b505822b78c6f40f73a8ab09710a3385410/templates/flavors/flatcar/patches/kubeadm-controlplane.yaml#L25
  • https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/79b63b505822b78c6f40f73a8ab09710a3385410/templates/flavors/flatcar/patches/kubeadm-controlplane.yaml#L25
  • https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/79b63b505822b78c6f40f73a8ab09710a3385410/templates/cluster-template-flatcar.yaml#L88
  • https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/79b63b505822b78c6f40f73a8ab09710a3385410/templates/test/ci/cluster-template-prow-flatcar.yaml#L93

CAPV

  • https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/ed46a4740628168d26f07c0a90150cd607d9e7f9/packaging/flavorgen/flavors/generators.go#L84
  • https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/ed46a4740628168d26f07c0a90150cd607d9e7f9/templates/cluster-template-ignition.yaml#L183
  • https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/ed46a4740628168d26f07c0a90150cd607d9e7f9/test/e2e/data/infrastructure-vsphere/main/ignition/ignition.yaml#L116

johananl avatar Sep 26 '23 13:09 johananl

We can't do the change in CABPK because we want the same bootstrap process to be usable on distros which may use Ignition but don't use containerd.

I see. We don't run CAPO ATM so I can't tell now. Fine to close from my POV.

kopiczko avatar Sep 27 '23 09:09 kopiczko

Thanks @kopiczko.

@invidian @pothos @jepio - does any of you think we should implement the workaround anyway as a precaution? If not, let's close this.

johananl avatar Sep 27 '23 11:09 johananl

If we just include the After=containerd.service part, then that shouldn't interfere with distros that use ignition + cri-o. So i think that would be safe to do and would allow us to close the issue.

jepio avatar Sep 27 '23 14:09 jepio

SGTM.

Tracking provider PRs here:

  • [x] CAPA - https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/4529
  • [x] CAPZ - https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/4049
  • [x] CAPV - https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/pull/2405

johananl avatar Sep 28 '23 09:09 johananl

Workaround is merged for all relevant providers.

/close

johananl avatar Oct 09 '23 08:10 johananl

@johananl: Closing this issue.

In response to this:

Workaround is merged for all relevant providers.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 09 '23 08:10 k8s-ci-robot

We also changed the (upstream) Flatcar containerd unit to use Type=notify to prevent any races.

pothos avatar Oct 09 '23 09:10 pothos