node-driver-registrar icon indicating copy to clipboard operation
node-driver-registrar copied to clipboard

csi-node-driver-registrar consumed ~44% cpu time to read container registry TimeZone information in Windows

Open Howard-Haiyang-Hao opened this issue 2 years ago • 14 comments

I was trying to narrow down a perf issue and noticed that csi.node-driver-registrar.exe consumed 44% cpu time within 359ms. The majority of this time was spent on reading Container registry timezoon information.

Worked with @andyzhangx and he recommeneded tracking this issue here .

csi-node-driver-registrar-cpu MicrosoftTeams-image (2)

I would like to know:

  1. if csi.node-driver-registrar.exe will delay the container startup time
  2. purpose of readying the timezone information.

Best regards, Howard Hao

Howard-Haiyang-Hao avatar Oct 14 '22 15:10 Howard-Haiyang-Hao

The csi node driver registrar primarily executes when the CSI node plugin is initializing and registering with Kubelet. Did you observe the above happening beyond the CSI node daemonset pod's initialization? Were there any interesting bits (especially errors/retries) in the logs of the csi node driver registrar container from the CSI node plugin pod?

ddebroy avatar Oct 14 '22 17:10 ddebroy

It's within the bounds of Pod initialization. I am curious why this process needs to read time zone registry keys. This operation seems take a quite bit CPU time. Thank you for the quick response.

Howard-Haiyang-Hao avatar Oct 14 '22 17:10 Howard-Haiyang-Hao

It's within the bounds of Pod initialization

This is quite unexpected in the first place (assuming Pod initialization above is referring to a general stateful workload pod that mounts PVs backed by the CSI plugin) as the driver registrar does not have a role to play beyond CSI Node registration. The logs may reveal if you have a situation where the plugin is failing to register or restarting for some reason.

I am curious why this process needs to read time zone registry keys

Just a guess (logs/stack traces needed to confirm) but it could be this sequence: https://github.com/kubernetes-csi/node-driver-registrar/blob/db46d1785a80c7f57ee74ed49fb9530be44708c2/cmd/csi-node-driver-registrar/main.go#L192 => https://github.com/kubernetes-csi/node-driver-registrar/blob/master/vendor/github.com/kubernetes-csi/csi-lib-utils/connection/connection.go#L113 => https://github.com/kubernetes-csi/node-driver-registrar/blob/master/vendor/github.com/kubernetes-csi/csi-lib-utils/connection/connection.go#L211

ddebroy avatar Oct 14 '22 18:10 ddebroy

Adding on profiling, you can use https://github.com/kubernetes-csi/node-driver-registrar/blob/6f7211c7884e434616aeb385863e32fe311fbde9/cmd/csi-node-driver-registrar/main.go#L73 to enable these endpoints https://github.com/kubernetes-csi/node-driver-registrar/blob/6f7211c7884e434616aeb385863e32fe311fbde9/cmd/csi-node-driver-registrar/node_register.go#L116-L124

mauriciopoppe avatar Oct 14 '22 19:10 mauriciopoppe

I think the demo process(9684) was still running, these instances were the kubelet-registion-probe operations that was happening every 10 seconds, with command like: /csi-node-driver-registrar.exe --kubelet-registration-path=C:\var\lib\kubelet\plugins\disk.csi.azure.com\csi.sock --mode=kubelet-registration-probe

csi-node-driver-registrar-probe

Howard-Haiyang-Hao avatar Oct 14 '22 20:10 Howard-Haiyang-Hao

If you're running a recent cluster version (1.25+) I'd suggest removing --mode=kubelet-registration-probe, this mode was added as a workaround due to https://github.com/kubernetes-csi/node-driver-registrar/issues/143 but it was fixed in https://github.com/kubernetes/kubernetes/issues/104584 i.e. we no longer need to probe to check that node-driver-registrar is up.

mauriciopoppe avatar Oct 14 '22 21:10 mauriciopoppe

Let me asked team to see if they can move to 1.25. Thanks.

I wonder if the demo and probe use the same code base, if we do, then we still need to figure out purpose of the Time Zone registry readings. Based on the call stack, it's called by csi-node-driver-registrar.exe directly by calling RegOpenKeyExW, but as you mentioned, I also can't find RegOpenKeyEx in this repo. I am guessing this may come from the Go Lang runtime.

Howard-Haiyang-Hao avatar Oct 15 '22 16:10 Howard-Haiyang-Hao

Let me asked team to see if they can move to 1.25. Thanks.

I wonder if the demo and probe use the same code base, if we do, then we still need to figure out purpose of the Time Zone registry readings. Based on the call stack, it's called by csi-node-driver-registrar.exe directly by calling RegOpenKeyExW, but as you mentioned, I also can't find RegOpenKeyEx in this repo. I am guessing this may come from the Go Lang runtime.

@Howard-Haiyang-Hao we don't need to wait for 1.25, we could change current azure disk daemonset config directly by removing livenessProbe and check whether it solves the issue, let's discuss offline, and I can share you the steps. https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/09066645538325be70cf0f28915ef484186c2ba9/deploy/csi-azuredisk-node-windows.yaml#L63-L70

andyzhangx avatar Oct 16 '22 12:10 andyzhangx

@andyzhangx, let me work with you offline to see if the workaround solves the issue. Thanks!

Howard-Haiyang-Hao avatar Oct 18 '22 04:10 Howard-Haiyang-Hao

I guess the following call actually trigger the Time Zone registry key emumations:

image

Any idea the purpose of this call?

thanks, Howard.

Howard-Haiyang-Hao avatar Oct 19 '22 02:10 Howard-Haiyang-Hao

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 17 '23 03:01 k8s-triage-robot

/remove-lifecycle stale

mauriciopoppe avatar Jan 19 '23 23:01 mauriciopoppe

/lifecycle frozen

mauriciopoppe avatar Feb 17 '23 20:02 mauriciopoppe

/remove-lifecycle stale

andyzhangx avatar Mar 03 '23 03:03 andyzhangx