node-driver-registrar
node-driver-registrar copied to clipboard
csi-node-driver-registrar consumed ~44% cpu time to read container registry TimeZone information in Windows
I was trying to narrow down a perf issue and noticed that csi.node-driver-registrar.exe consumed 44% cpu time within 359ms. The majority of this time was spent on reading Container registry timezoon information.
Worked with @andyzhangx and he recommeneded tracking this issue here .
I would like to know:
- if csi.node-driver-registrar.exe will delay the container startup time
- purpose of readying the timezone information.
Best regards, Howard Hao
The csi node driver registrar primarily executes when the CSI node plugin is initializing and registering with Kubelet. Did you observe the above happening beyond the CSI node daemonset pod's initialization? Were there any interesting bits (especially errors/retries) in the logs of the csi node driver registrar container from the CSI node plugin pod?
It's within the bounds of Pod initialization. I am curious why this process needs to read time zone registry keys. This operation seems take a quite bit CPU time. Thank you for the quick response.
It's within the bounds of Pod initialization
This is quite unexpected in the first place (assuming Pod initialization above is referring to a general stateful workload pod that mounts PVs backed by the CSI plugin) as the driver registrar does not have a role to play beyond CSI Node registration. The logs may reveal if you have a situation where the plugin is failing to register or restarting for some reason.
I am curious why this process needs to read time zone registry keys
Just a guess (logs/stack traces needed to confirm) but it could be this sequence: https://github.com/kubernetes-csi/node-driver-registrar/blob/db46d1785a80c7f57ee74ed49fb9530be44708c2/cmd/csi-node-driver-registrar/main.go#L192 => https://github.com/kubernetes-csi/node-driver-registrar/blob/master/vendor/github.com/kubernetes-csi/csi-lib-utils/connection/connection.go#L113 => https://github.com/kubernetes-csi/node-driver-registrar/blob/master/vendor/github.com/kubernetes-csi/csi-lib-utils/connection/connection.go#L211
Adding on profiling, you can use https://github.com/kubernetes-csi/node-driver-registrar/blob/6f7211c7884e434616aeb385863e32fe311fbde9/cmd/csi-node-driver-registrar/main.go#L73 to enable these endpoints https://github.com/kubernetes-csi/node-driver-registrar/blob/6f7211c7884e434616aeb385863e32fe311fbde9/cmd/csi-node-driver-registrar/node_register.go#L116-L124
I think the demo process(9684) was still running, these instances were the kubelet-registion-probe operations that was happening every 10 seconds, with command like: /csi-node-driver-registrar.exe --kubelet-registration-path=C:\var\lib\kubelet\plugins\disk.csi.azure.com\csi.sock --mode=kubelet-registration-probe
If you're running a recent cluster version (1.25+) I'd suggest removing --mode=kubelet-registration-probe
, this mode was added as a workaround due to https://github.com/kubernetes-csi/node-driver-registrar/issues/143 but it was fixed in https://github.com/kubernetes/kubernetes/issues/104584 i.e. we no longer need to probe to check that node-driver-registrar is up.
Let me asked team to see if they can move to 1.25. Thanks.
I wonder if the demo and probe use the same code base, if we do, then we still need to figure out purpose of the Time Zone registry readings. Based on the call stack, it's called by csi-node-driver-registrar.exe directly by calling RegOpenKeyExW, but as you mentioned, I also can't find RegOpenKeyEx in this repo. I am guessing this may come from the Go Lang runtime.
Let me asked team to see if they can move to 1.25. Thanks.
I wonder if the demo and probe use the same code base, if we do, then we still need to figure out purpose of the Time Zone registry readings. Based on the call stack, it's called by csi-node-driver-registrar.exe directly by calling RegOpenKeyExW, but as you mentioned, I also can't find RegOpenKeyEx in this repo. I am guessing this may come from the Go Lang runtime.
@Howard-Haiyang-Hao we don't need to wait for 1.25, we could change current azure disk daemonset config directly by removing livenessProbe
and check whether it solves the issue, let's discuss offline, and I can share you the steps.
https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/09066645538325be70cf0f28915ef484186c2ba9/deploy/csi-azuredisk-node-windows.yaml#L63-L70
@andyzhangx, let me work with you offline to see if the workaround solves the issue. Thanks!
I guess the following call actually trigger the Time Zone registry key emumations:
Any idea the purpose of this call?
thanks, Howard.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
/lifecycle frozen
/remove-lifecycle stale