kops S3 config lost in openstack for worker nodes

/kind bug

1. What kops version are you running? The command kops version, will display this information.

kops version
Client version: 1.27.0

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"archive", BuildDate:"2023-06-15T08:14:06Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.6", GitCommit:"11902a838028edef305dfe2f96be929bc4d114d8", GitTreeState:"clean", BuildDate:"2023-06-14T09:49:08Z", GoVersion:"go1.19.10", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using? openstack

4. What commands did you run? What is the simplest way to reproduce this issue?

S3_ENDPOINT=http://ceph_rgw_url S3_ACCESS_KEY_ID=XXX S3_SECRET_ACCESS_KEY=XXX kops --name name.k8s.local --state do://kops rolling-update cluster --yes

The --state parameter starts with do:// due to https://github.com/kubernetes/kops/issues/9926

5. What happened after the commands executed? Timeout to wait woker node join cluster.

6. What did you expect to happen? The worker node successfully joins the cluster

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2023-06-11T11:51:45Z"
  generation: 4
  name: name.k8s.local
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  certManager:
    defaultIssuer: letsencrypt-prod
    enabled: true
  channel: stable
  cloudConfig:
    openstack:
      blockStorage:
        bs-version: v3
        clusterName: name.k8s.local
        ignore-volume-az: false
      loadbalancer:
        floatingNetwork: floatingNetwork
        floatingNetworkID: floatingNetworkID
        method: ROUND_ROBIN
        provider: amphora
        useOctavia: true
      monitor:
        delay: 15s
        maxRetries: 3
        timeout: 10s
      router:
        externalNetwork: externalNetwork
  cloudControllerManager:
    clusterName: name.k8s.local
  cloudProvider: openstack
  configBase: do://kops/name.k8s.local
  containerd:
    configOverride: |
      version = 2
      [plugins."io.containerd.grpc.v1.cri"]
      sandbox_image = "registry.k8s.io/pause:3.6@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      runtime_type = "io.containerd.runc.v2"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
      SystemdCgroup = true

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]
      endpoint = [ "http://registry.mirrors.local" ]
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: control-plane-nova-1
      name: etcd-1
      volumeType: __DEFAULT__
    - instanceGroup: control-plane-nova-2
      name: etcd-2
      volumeType: __DEFAULT__
    - instanceGroup: control-plane-nova-3
      name: etcd-3
      volumeType: __DEFAULT__
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: control-plane-nova-1
      name: etcd-1
      volumeType: __DEFAULT__
    - instanceGroup: control-plane-nova-2
      name: etcd-2
      volumeType: __DEFAULT__
    - instanceGroup: control-plane-nova-3
      name: etcd-3
      volumeType: __DEFAULT__
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  - ::/0
  kubernetesVersion: 1.26.6
  metricsServer:
    enabled: true
  networkCIDR: 10.100.0.0/16
  networking:
    flannel:
      backend: vxlan
  nodePortAccess:
  - 0.0.0.0/0
  nonMasqueradeCIDR: 100.64.0.0/10
  snapshotController:
    enabled: true
  sshAccess:
  - 0.0.0.0/0
  - ::/0
  sshKeyName: sshKeyName
  subnets:
  - cidr: 10.100.32.0/19
    name: nova
    type: Private
    zone: nova
  - cidr: 10.100.0.0/22
    name: utility-nova
    type: Utility
    zone: nova
  topology:
    dns:
      type: Private
    masters: private
    nodes: private

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

kops-configuration log

-- Logs begin at Thu 2023-07-20 11:28:34 UTC, end at Fri 2023-07-21 02:49:50 UTC. --
Jul 20 11:30:08 nodes-nova-asqybc systemd[1]: Starting Run kOps bootstrap (nodeup)...
Jul 20 11:30:08 nodes-nova-asqybc nodeup[1222]: nodeup version 1.27.0 (git-v1.27.0)
Jul 20 11:30:08 nodes-nova-asqybc nodeup[1222]: I0720 11:30:08.655647    1222 s3context.go:338] product_uuid is "c49d83ff-ceb1-43f5-bc9b-ca7cf67b9896", assuming not running on EC2
Jul 20 11:30:08 nodes-nova-asqybc nodeup[1222]: I0720 11:30:08.655709    1222 s3context.go:175] defaulting region to "us-east-1"
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: 2023/07/20 11:30:09 WARN: failed to get session token, falling back to IMDSv1: 404 Not Found: Not Found
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]:         status code: 404, request id:
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: caused by: EC2MetadataError: failed to make EC2Metadata request
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: 404 Not Found
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: The resource could not be found.
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]:
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]:         status code: 404, request id:
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: I0720 11:30:09.151317    1222 s3context.go:192] unable to get bucket location from region "us-east-1"; scanning all regions: NoCredentialProviders: no valid providers in chain
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: caused by: EnvAccessKeyNotFound: failed to find credentials in the environment.
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: SharedCredsLoad: failed to load profile, .
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: EC2RoleRequestError: no EC2 instance role found

`/etc/sysconfig/kops-configuration`

When I checked this file, I found that these configurations were missing when compared with the good node.

S3_ACCESS_KEY_ID=XXX
S3_ENDPOINT=http://ceph_rgw_url
S3_REGION=
S3_SECRET_ACCESS_KEY=XXX

cloud-init

When I checked cloud-init in node by running curl http://169.254.169.254/latest/user-data/, I found those env were missing too.

Jul 21 '23 03:07 pcmid

I added those s3 env to /etc/sysconfig/kops-configuration manully, then restarted the kops-configuration service. The nodeup wokered fine and the node joined the cluster finally.

Jul 21 '23 03:07 pcmid

As a workaround, could you try using --dns=none when creating the cluster?

Jul 21 '23 10:07 hakman

As a workaround, could you try using --dns=none when creating the cluster?

Thanks for the reply. I successfully created a new cluster with --dns=none. For an existing cluster, can I update the cluster configuration file to set the dns block like this? One thing that confuses me is why this problem is related to dns.

topology:
  dns:
    type: None
  masters: private

Jul 21 '23 18:07 pcmid

This was somehow lost as part of the mitigation for https://github.com/kubernetes/kops/issues/15539. See this comment for guidance on on how to switch: https://github.com/kubernetes/kops/pull/15643#issuecomment-1637151077.

Jul 22 '23 06:07 hakman

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 25 '24 09:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 24 '24 10:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 25 '24 10:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 25 '24 10:03 k8s-ci-robot