hcloud-cloud-controller-manager
hcloud-cloud-controller-manager copied to clipboard
Hcloud manager errors: Couldn't reconcile node routes error listing routes context deadline exceeded
We are using hcloud manager in cluster deployed on Hetzner VMs. Hcloud manager is deployed with network support. After few days it started hit the hetzner cloud api limits and log the following errors :
E0830 16:20:57.819595 1 route_controller.go:118] Couldn't reconcile node routes: error listing routes: hcloud/ListRoutes: hcloud/hcloudRouteToRoute: hcops/AllServersCache.ByPrivateIP: 192.168.1.7 hcops/AllServersCache.getCache: Get "https://api.hetzner.cloud/v1/servers?page=1&per_page=50": context deadline exceeded
Hcloud manager is deployed with kubespray addons and without some specific configurations. It doesnt look to effect the cluster somehow for now , but from the logs it looks like an issue and effects our terraform commands even that they are using different api keys for hetzner cloud api.
We have had the same issues recently and constantly hit the Hetzner cloud's rate limit probably due to retries.
Also, the document says the rate limit is per project, not per API key and the support team refused to increase rate limit :-(
I’ve often run into the same issues – Hetzner API rate limits are too strict and to low. I see very often rate limit errors, too.
Sad that Hetzner is not willing to increase the rate limits as it cannot be used for serious setups, if it hits these API rate limits ...
Am 04.09.2022 um 13:52 schrieb Aveline @.***>:
We have had the same issues recently and constantly hit the Hetzner cloud's rate limit probably due to retries.
Also, the document says the rate limit is per project, not per API key and the support team refused to increase rate limit :-(
— Reply to this email directly, view it on GitHub https://github.com/hetznercloud/hcloud-cloud-controller-manager/issues/308#issuecomment-1236320523, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEMIAFJKQ6KJSLISZIX65TV4SEPBANCNFSM6AAAAAAQAT3MSE. You are receiving this because you are subscribed to this thread.
@LKaemmerling Thanks linked the fix to the Issue .Do you know , when it will be released ?
We'll release it this week, maybe tomorrow! I'll keep you up to date 👌🏼
After the release , hcloud manager still hits the api limit :
E0918 04:11:36.921648 1 route_controller.go:119] Couldn't reconcile node routes: error listing routes: hcloud/ListRoutes: hcloud/hcloudRouteToRoute: hcops/AllServersCache.ByPrivateIP: hcops/AllServersCache.getCache: Get "https://api.hetzner.cloud/v1/servers?page=1&per_page=50": context deadline exceeded
Honestly , thats not serious , how to deploy your production workloads in Hetzner in that case?
Please reopen , we cant use our terraform , because the hcloud manager always hits the API limit.
Please reopen , we cant use our terraform , because the hcloud manager always hits the API limit.
@mmpetarpeshev Sorry for the late reply. We will ofc take care of this! Two questions here:
- Do you use the newest version? (v1.13.0)
- If yes, is the error still the same?
I just want to make sure that its only the API limits hitting you. The context deadline exceeded
was kinda blurry error message. The newest version should print a more specific error (besides the deadline exceeded)
Hi @4ND3R50N , thanks for taking care of that.
1.We are using docker image tag as I pull the image few days ago.Will check later today the version from the logs. 2.Error message was little bit different I think , something like :
route_controller.go:119] Couldn't reconcile node routes: error listing routes: hcloud/ListRoutes: hcloud/reloadNetwork: limit of 3600 requests per hour reached (rate_limit_exceeded)
Will check everything later today and will provide update
@mmpetarpeshev
Ok, good news, so its the API limit. Dont worry, this is an internal mechanism to prevent spam. I will talk to some collegues how we gonna proceed with those cases since youre not the only one having trouble with it.
Waiting for your final update, i will also keep u up to date :-)
I checked the logs and there is the line : Hetzner Cloud k8s cloud controller v1.9.1 started
I tried with latest docker image tag and with v1.13.0. Tried with helm deployment and ansible (aka daemon set or deployment).
Not sure is that the correct version as you said v1.13.0 , from what I saw the docker images latest tag is updated few days ago.
@mmpetarpeshev
Ok, good news, so its the API limit. Dont worry, this is an internal mechanism to prevent spam. I will talk to some collegues how we gonna proceed with those cases since youre not the only one having trouble with it.
Waiting for your final update, i will also keep u up to date :-)
Hi, is there any ETA of this issue? We're still constantly hitting this issue even after upgrading to v1.13.1.
E0930 08:01:14.976446 1 route_controller.go:119] Couldn't reconcile node routes: error listing routes: hcloud/ListRoutes: hcloud/reloadNetwork: limit of 3600 requests per hour reached (rate_limit_exceeded)
E0930 08:01:15.832617 1 node_controller.go:364] Failed to update node addresses for node "us-east1-prd-worker-13": failed to get node address from cloud provider that matches ip: 10.241.0.28
E0930 08:01:15.932972 1 route_controller.go:119] Couldn't reconcile node routes: error listing routes: hcloud/ListRoutes: hcloud/reloadNetwork: limit of 3600 requests per hour reached (rate_limit_exceeded)
E0930 08:01:18.931466 1 route_controller.go:119] Couldn't reconcile node routes: error listing routes: hcloud/ListRoutes: hcloud/reloadNetwork: limit of 3600 requests per hour reached (rate_limit_exceeded)
E0930 08:01:19.040918 1 route_controller.go:119] Couldn't reconcile node routes: error listing routes: hcloud/ListRoutes: hcloud/reloadNetwork: limit of 3600 requests per hour reached (rate_limit_exceeded)
E0930 08:01:19.677038 1 route_controller.go:119] Couldn't reconcile node routes: error listing routes: hcloud/ListRoutes: hcloud/reloadNetwork: limit of 3600 requests per hour reached (rate_limit_exceeded)
Tried to ask support to increase the API limit temporarily but they said no.
Tried to ask support to increase the API limit temporarily but they said no.
@ym can you please the ticket ID to us or reply to this ticket with the explicit mention of my name?
@LKaemmerling
Thanks, the ticket ID is #2022083103009613
@ym you will get an answer :)
We want to debug this even further. With one of the last releases, we got a contribution that added metrics to all API calls (https://github.com/hetznercloud/hcloud-cloud-controller-manager/pull/303). You should be able to see how often specific endpoints were called by looking at the metrics of the CCM. Can you send us maybe a screenshot from your grafana dashboard - or if possible - send us access to this dashboard via mail to lukas.kaemmerling(at)hetzner-cloud.de ?
@ym okay you won't get a mail :D i have the honor to say that your limit was just increased :)
@ym And we apologize for the trouble, because
Sad that Hetzner is not willing to increase the rate limits
we do increase API Limits for various use-cases. In this case the request was unfortunately not forwarded to the responsible department. We already contacted the support to refresh the knowledge of the proper workflow for these requests.
I'm also currently running into rate limit issues.. Are there any plans to maybe increase limits for endpoints used by this CCM?
Especially when doing maintenance on your cluster (adding nodes, testing nodes, removing nodes, ... ) you'll be rate limited very fast. It's quite annoying tbh.
I'm struggling everyday with that ,if I hadn't invested so much time to deploy k8s and all apps in hetzner , the first thing that I would do is to move out . Sorry guys , but thats absolute amateur work here, the worst api service that seen ever.
@maaft @mmpetarpeshev could you please try to do what I requested here: https://github.com/hetznercloud/hcloud-cloud-controller-manager/issues/308#issuecomment-1263282005
We need to understand what you cluster is doing :)
thanks @LKaemmerling will try these days to get these metrics and provide it to you.
This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.