cloud-provider-openstack [cinder-csi-plugin] [Bug] Failed to GetOpenStackProvider i/o timeout

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: When starting the csi plugin it is not able to communicate with keystone. It will get stuck in an io timeout.

What you expected to happen: The plugin should talk to the API and start.

How to reproduce it: I am running a 1.24.0 cluster with Version 1.23.0 and CoreDNS 1.9.2

Anything else we need to know?: A TCPDump suggests that the pod tries to resolve the wrong URL. It will try to connect to ${URL}.kube-system.svc.cluster.local. The same version of the csi driver works on Kubernetes 1.23 and CoreDNS 1.8.7.

Environment:

openstack-cloud-controller-manager(or other related binary) version: 1.23
OpenStack version: Victoria
Others:

May 19 '22 11:05 modzilla99

The current info seems too generic

A TCPDump suggests that the pod tries to resolve the wrong URL. It will try to connect to ${URL}.kube-system.svc.cluster.local.

looks like it's not CPO CSI function not working, it's the pod not able to connect to keystone some erorr might be

the pod can't connect to the service due to network not reachable
as you described, not able to parse the right DNS (no detailed info though)

so I think more info like the real error you saw, the logs of CSI pods etc will be helpful

May 20 '22 09:05 jichenjc

I seem to be having a similar problem. So let me hopefully provide enough information to get somewhere with this.

Cluster info: Kubernetes 1.24.1, CoreDNS 1.8.6, csi-cinder-plugin 1.22.0 (and I tested with 1.24.2).

Cloud config for csi-cinder-plugin.

[Global]
auth-url="http://keystone.m6me.cheetahfox.com:80/v3"
username="k8s"
password="*********************"
region="RegionOne"
tenant-id="7d5e3725250c434cb935a43dc34865d9"
tenant-name="k8s"
domain-name="Default"
os-endpoint-type="internalURL"

[BlockStorage]
bs-version=v3
ignore-volume-az=False

Logs from the csi-cinder-controllerplugin pod / container : cinder-csi-plugin

I0626 04:40:07.793361       1 driver.go:74] Driver: cinder.csi.openstack.org
I0626 04:40:07.793489       1 driver.go:75] Driver version: [email protected]
I0626 04:40:07.793496       1 driver.go:76] CSI Spec version: 1.3.0
I0626 04:40:07.793530       1 driver.go:106] Enabling controller service capability: LIST_VOLUMES
I0626 04:40:07.793538       1 driver.go:106] Enabling controller service capability: CREATE_DELETE_VOLUME
I0626 04:40:07.793544       1 driver.go:106] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0626 04:40:07.793549       1 driver.go:106] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0626 04:40:07.793554       1 driver.go:106] Enabling controller service capability: LIST_SNAPSHOTS
I0626 04:40:07.793564       1 driver.go:106] Enabling controller service capability: EXPAND_VOLUME
I0626 04:40:07.793569       1 driver.go:106] Enabling controller service capability: CLONE_VOLUME
I0626 04:40:07.793574       1 driver.go:106] Enabling controller service capability: LIST_VOLUMES_PUBLISHED_NODES
I0626 04:40:07.793578       1 driver.go:106] Enabling controller service capability: GET_VOLUME
I0626 04:40:07.793583       1 driver.go:118] Enabling volume access mode: SINGLE_NODE_WRITER
I0626 04:40:07.793589       1 driver.go:128] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0626 04:40:07.793595       1 driver.go:128] Enabling node service capability: EXPAND_VOLUME
I0626 04:40:07.793599       1 driver.go:128] Enabling node service capability: GET_VOLUME_STATS
I0626 04:40:07.794109       1 openstack.go:90] Block storage opts: {0 false false}
W0626 04:40:37.796236       1 main.go:108] Failed to GetOpenStackProvider: Post "http://keystone.m6me.cheetahfox.com:80/v3/auth/tokens": dial tcp: i/o timeout

When looking at the network traffic from the cinder-csi-plugin, I see only DNS requests for AAAA and A records looking for this dns name.

keystone.m6me.cheetahfox.com.kube-system.svc.cluster.local

So I see the same strange thing that was reported above. The container seems to be trying to resolve this address with ".kube-system.svc.cluster.local" appended to the valid auth url.

The keystone API is at that url. I don't think this is a networking issue since I can access the API from other pods in the cluster( it's kinda hard to check from the container itself since it doesn't really have any tools and it restarts after about 20 seconds).

This configuration was working just fine in Kubernetes 1.22. I upgraded the cluster to 1.23.7 and then 1.24.1 and everything was working fine for about a week. Then for unrelated reasons I needed to restart the VM's in this cluster. After the restart was when I noticed this container wasn't ready along with all of my pods that have Cinder provided pvc were not working.

The other containers in the pod just have the following logs look like this with "Still connecting" repeating about every ten minutes.

josh@Cheetah:~/network-automation/services/openstack-deployment$ kubectl logs --namespace=kube-system csi-cinder-controllerplugin-6549b5d56-tgsfx csi-provisioner
I0626 20:34:49.018523       1 csi-provisioner.go:138] Version: v3.0.0
I0626 20:34:49.018642       1 csi-provisioner.go:161] Building kube configs for running in cluster...
W0626 20:34:59.021171       1 connection.go:173] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock
W0626 20:35:09.021641       1 connection.go:173] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock

I was looking at the driver code and I don't see how it could be getting a different url in the driver itself. Could it be something from gophercloud? Tracing things back up the stack it seems like that might be where this is happening. But I am not sure...

I also tried setting os-endpoint-type to "internalURL" since about the only thing I could figure is that gophercloud was changing something with the url because of endpoint-type. This seemed to have no effect. I have tried removing the :80 in url. Because why not... Also no effect. I am going to try to downgrade my cluster and hope this start working with kubernetes 1.23.7.

Jun 26 '22 21:06 cheetahfox

from context, looks like the URL is taken as short service name and appended local host domain name so it's CSI issue instead likely to be a k8s or your DNS server setting e.g https://en.wikipedia.org/wiki/Fully_qualified_domain_name tells us a URL which is not FQDN will be append a domain name and in this case it its .. so think the workaround might be make keystone.m6me.cheetahfox.com as IP in your configuration or setup the DNS correctly to avoid the append of the service name (how to make this I don't know ,still digging)

Jun 27 '22 06:06 jichenjc

Hello, I have exactly the same issue, did you finally find a solution ?

Thanks

Jeff

Sep 06 '22 13:09 jfpucheu

are you able to connect to the openstack endpoint from your local ? e.g the DNS with the cloud.conf you used?

Sep 06 '22 13:09 jichenjc

yes from the node yes, i don't understand why cinder-csi-pluggin can't.....

I0906 13:34:01.026232 1 openstack.go:89] Block storage opts: {0 false false} W0906 13:34:31.026950 1 main.go:100] Failed to GetOpenStackProvider: Post "https://iam.eu-west-0.mycloudprovider.com/v3/auth/tokens": dial tcp: i/o timeout

The openstack-cloud-controller-manager-9kkt have no Issue ... to join same endpoint...

Sep 06 '22 13:09 jfpucheu

Original issue is So I see the same strange thing that was reported above. The container seems to be trying to resolve this address with ".kube-system.svc.cluster.local" appended to the valid auth url.

which is incorrect .. I suggest you try use ip instead of hostname of openstack service and try again to see whether it's same pattern, basically, I think it's related to DNS setup but I am not sure why OCCM works but Cinder CSI not..

Sep 06 '22 14:09 jichenjc

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 05 '22 14:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jan 04 '23 15:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Feb 03 '23 15:02 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 03 '23 15:02 k8s-ci-robot

NFO] 192.168.145.12:44558 - 55791 "A IN <my openstack url>.kube-system.svc.cluster.local. udp 70 false 512" NXDOMAIN qr,aa,rd 163 0.000231634s
[INFO] 192.168.145.12:47886 - 48222 "AAAA IN <my openstack url>.kube-system.svc.cluster.local. udp 70 false 512" NXDOMAIN qr,aa,rd 163 0.000184359s
[INFO] 192.168.145.12:44347 - 13713 "AAAA IN <my openstack url>.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd 151 0.000144048s
[INFO] 192.168.145.12:43551 - 9405 "A IN openstack.im.pype.tech.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd 151 0.000242885s
[INFO] 192.168.145.12:45532 - 12467 "A IN <my openstack url>.cluster.local. udp 54 false 512" NXDOMAIN qr,aa,rd 147 0.000225486s
[INFO] 192.168.145.12:46515 - 14766 "A IN <my openstack url>.openstacklocal. udp 55 false 512" NXDOMAIN qr,rd,ra 130 0.001658468s

These log entries are from coredns pods after enabling the query logs. But the cloud-config that I've provided have the correct URL. I even tried to use IP rather than using dns for openstack.

I'm not sure why it appends these local svc urls etc

Apr 28 '24 22:04 sqaisar

cloud-provider-openstack cloud-provider-openstack copied to clipboard

[cinder-csi-plugin] [Bug] Failed to GetOpenStackProvider i/o timeout

cloud-provider-openstack
cloud-provider-openstack copied to clipboard