S3 config lost in openstack for worker nodes
/kind bug
1. What kops version are you running? The command kops version, will display
this information.
kops version
Client version: 1.27.0
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"archive", BuildDate:"2023-06-15T08:14:06Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.6", GitCommit:"11902a838028edef305dfe2f96be929bc4d114d8", GitTreeState:"clean", BuildDate:"2023-06-14T09:49:08Z", GoVersion:"go1.19.10", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using?
openstack
4. What commands did you run? What is the simplest way to reproduce this issue?
S3_ENDPOINT=http://ceph_rgw_url S3_ACCESS_KEY_ID=XXX S3_SECRET_ACCESS_KEY=XXX kops --name name.k8s.local --state do://kops rolling-update cluster --yes
The
--stateparameter starts withdo://due to https://github.com/kubernetes/kops/issues/9926
5. What happened after the commands executed? Timeout to wait woker node join cluster.
6. What did you expect to happen? The worker node successfully joins the cluster
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
creationTimestamp: "2023-06-11T11:51:45Z"
generation: 4
name: name.k8s.local
spec:
api:
loadBalancer:
type: Public
authorization:
rbac: {}
certManager:
defaultIssuer: letsencrypt-prod
enabled: true
channel: stable
cloudConfig:
openstack:
blockStorage:
bs-version: v3
clusterName: name.k8s.local
ignore-volume-az: false
loadbalancer:
floatingNetwork: floatingNetwork
floatingNetworkID: floatingNetworkID
method: ROUND_ROBIN
provider: amphora
useOctavia: true
monitor:
delay: 15s
maxRetries: 3
timeout: 10s
router:
externalNetwork: externalNetwork
cloudControllerManager:
clusterName: name.k8s.local
cloudProvider: openstack
configBase: do://kops/name.k8s.local
containerd:
configOverride: |
version = 2
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "registry.k8s.io/pause:3.6@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]
endpoint = [ "http://registry.mirrors.local" ]
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- instanceGroup: control-plane-nova-1
name: etcd-1
volumeType: __DEFAULT__
- instanceGroup: control-plane-nova-2
name: etcd-2
volumeType: __DEFAULT__
- instanceGroup: control-plane-nova-3
name: etcd-3
volumeType: __DEFAULT__
manager:
backupRetentionDays: 90
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- instanceGroup: control-plane-nova-1
name: etcd-1
volumeType: __DEFAULT__
- instanceGroup: control-plane-nova-2
name: etcd-2
volumeType: __DEFAULT__
- instanceGroup: control-plane-nova-3
name: etcd-3
volumeType: __DEFAULT__
manager:
backupRetentionDays: 90
memoryRequest: 100Mi
name: events
iam:
allowContainerRegistry: true
legacy: false
kubelet:
anonymousAuth: false
kubernetesApiAccess:
- 0.0.0.0/0
- ::/0
kubernetesVersion: 1.26.6
metricsServer:
enabled: true
networkCIDR: 10.100.0.0/16
networking:
flannel:
backend: vxlan
nodePortAccess:
- 0.0.0.0/0
nonMasqueradeCIDR: 100.64.0.0/10
snapshotController:
enabled: true
sshAccess:
- 0.0.0.0/0
- ::/0
sshKeyName: sshKeyName
subnets:
- cidr: 10.100.32.0/19
name: nova
type: Private
zone: nova
- cidr: 10.100.0.0/22
name: utility-nova
type: Utility
zone: nova
topology:
dns:
type: Private
masters: private
nodes: private
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
kops-configuration log
-- Logs begin at Thu 2023-07-20 11:28:34 UTC, end at Fri 2023-07-21 02:49:50 UTC. --
Jul 20 11:30:08 nodes-nova-asqybc systemd[1]: Starting Run kOps bootstrap (nodeup)...
Jul 20 11:30:08 nodes-nova-asqybc nodeup[1222]: nodeup version 1.27.0 (git-v1.27.0)
Jul 20 11:30:08 nodes-nova-asqybc nodeup[1222]: I0720 11:30:08.655647 1222 s3context.go:338] product_uuid is "c49d83ff-ceb1-43f5-bc9b-ca7cf67b9896", assuming not running on EC2
Jul 20 11:30:08 nodes-nova-asqybc nodeup[1222]: I0720 11:30:08.655709 1222 s3context.go:175] defaulting region to "us-east-1"
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: 2023/07/20 11:30:09 WARN: failed to get session token, falling back to IMDSv1: 404 Not Found: Not Found
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: status code: 404, request id:
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: caused by: EC2MetadataError: failed to make EC2Metadata request
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: 404 Not Found
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: The resource could not be found.
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]:
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: status code: 404, request id:
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: I0720 11:30:09.151317 1222 s3context.go:192] unable to get bucket location from region "us-east-1"; scanning all regions: NoCredentialProviders: no valid providers in chain
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: caused by: EnvAccessKeyNotFound: failed to find credentials in the environment.
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: SharedCredsLoad: failed to load profile, .
Jul 20 11:30:09 nodes-nova-asqybc nodeup[1222]: EC2RoleRequestError: no EC2 instance role found
/etc/sysconfig/kops-configuration
When I checked this file, I found that these configurations were missing when compared with the good node.
S3_ACCESS_KEY_ID=XXX
S3_ENDPOINT=http://ceph_rgw_url
S3_REGION=
S3_SECRET_ACCESS_KEY=XXX
cloud-init
When I checked cloud-init in node by running curl http://169.254.169.254/latest/user-data/, I found those env were missing too.
I added those s3 env to /etc/sysconfig/kops-configuration manully, then restarted the kops-configuration service. The nodeup wokered fine and the node joined the cluster finally.
As a workaround, could you try using --dns=none when creating the cluster?
As a workaround, could you try using
--dns=nonewhen creating the cluster?
Thanks for the reply. I successfully created a new cluster with --dns=none.
For an existing cluster, can I update the cluster configuration file to set the dns block like this? One thing that confuses me is why this problem is related to dns.
topology:
dns:
type: None
masters: private
This was somehow lost as part of the mitigation for https://github.com/kubernetes/kops/issues/15539. See this comment for guidance on on how to switch: https://github.com/kubernetes/kops/pull/15643#issuecomment-1637151077.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.