talos-cloud-controller-manager icon indicating copy to clipboard operation
talos-cloud-controller-manager copied to clipboard

CCM using too many open files?

Open rsmitty opened this issue 1 year ago • 5 comments

Unsure if this is a bug quite yet. But with a customer using the CCM, we're seeing the following in a cluster that scales up and down by several hundred nodes pretty often:

E0410 21:54:21.057323       1 node_controller.go:277] Error getting instance metadata for node addresses: error getting metadata from the node ip-10-2-55-138.ec2.internal: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.2.55.138:50000: socket: too many open files"
E0410 21:54:21.057629       1 node_controller.go:277] Error getting instance metadata for node addresses: error getting metadata from the node ip-10-2-85-77.ec2.internal: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.2.85.77:50000: socket: too many open files"

The file-max value is very large, 13million+, so I'm doubtful this is a sysctl setting problem. In googling around, we did see that /proc/sys/fs/inotify/max_user_instances was 8192 and could be related to an error like this.

But either way, it feels like maybe there's somewhere we're not closing connections in the CCM that could cause us to hit some limitt?

rsmitty avatar Apr 12 '24 15:04 rsmitty

Looking a little further, this seems to come from this call: https://github.com/siderolabs/talos-cloud-controller-manager/blob/main/pkg/talos/instances.go#L64

This in turn calls https://github.com/siderolabs/talos-cloud-controller-manager/blob/main/pkg/talos/client.go#L67. So I'm wondering if this is actually something in COSI. Also notice the COSI version is quite old in the go.mod.

rsmitty avatar Apr 12 '24 15:04 rsmitty

Thank you for the bug report.

I've checked all my clusters, and did not find file descriptor leaks. Probably because mu clusters do not scale up/down very often.

Lets update dependences first, and I will collect file descriptor statistics.

sergelogvinov avatar Apr 16 '24 04:04 sergelogvinov

Can you add more details please.

What the Talos version do you use, Talos CCM commit hash, and type of deployment of CCM (daemonset/deploy) ?

Thanks

sergelogvinov avatar Apr 16 '24 04:04 sergelogvinov

I see you already bumped the dependencies, but just to make sure you've got the info: for this customer, CCM is a deployment, Talos version is 1.6.7, and CCM version is latest release (1.4.0).

rsmitty avatar Apr 16 '24 12:04 rsmitty

I see you already bumped the dependencies, but just to make sure you've got the info: for this customer, CCM is a deployment, Talos version is 1.6.7, and CCM version is latest release (1.4.0).

Oh, release (1.4.0)... try edge version please.

sergelogvinov avatar Apr 16 '24 13:04 sergelogvinov

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions[bot] avatar Oct 14 '24 08:10 github-actions[bot]

@rsmitty was this fixed?

DmitriyMV avatar Oct 14 '24 12:10 DmitriyMV

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions[bot] avatar Apr 14 '25 08:04 github-actions[bot]

This issue was closed because it has been stalled for 14 days with no activity.

github-actions[bot] avatar Apr 28 '25 08:04 github-actions[bot]