docker-machine-driver-hetzner
docker-machine-driver-hetzner copied to clipboard
Improve robustness during outages
Yesterday, the Hetzner Cloud API had an outage, and it appears that the docker machine driver did not handle it well.
You can see that from 2024-11-13 17:00:00 to 2024-11-14 08:00:00, the amount of requests to /server_types, /images and /locations is unexpectedly high. Also, the amount of requests for single action was also really high.
This leads into rate limits, while waiting for servers to be created.
I see a few possible improvements:
- When waiting for action, use an exponential back off algorithm to spread the requests over time. You can cap the max waiting time to a sensible value. https://pkg.go.dev/github.com/hetznercloud/hcloud-go/v2/hcloud#WithPollOpts https://pkg.go.dev/github.com/hetznercloud/hcloud-go/v2/hcloud#ExponentialBackoffWithOpts
- Use a single API call to wait for multiple related actions, using https://pkg.go.dev/github.com/hetznercloud/hcloud-go/v2/hcloud#ActionClient.WaitFor or https://pkg.go.dev/github.com/hetznercloud/hcloud-go/v2/hcloud#ActionClient.WaitForFunc (note that the
Watch*API is deprecated). - Maybe cache the call the /locations, /server types and /images, those shouldn't change that often. Unless you are checking for a server type availability ?
Free stress testing, I don't see the issue.
Bad jokes aside, sorry this caused you headaches. I'll have a look to get the exponential back-off implemented soon. Regarding error handling in general, I am somewhat torn as to what the best approach is. We do have explicit retry with a set timeout, which was implemented as a feature request. The default behaviour is to fail-fast, as it always was, but it could be changed in a major version bump. When using the CLI this would be what I expect, but I do see the issue with some docker-machine RPC talking applications, such as Rancher, going for a request-storm in fail-fast mode.
As for the caching, I do get the point of them being stable. However, I cannot really be sure in which environment the driver is running. Granted, vanilla docker-machine would be useless without a writeable home directory. But given its PRC nature, it could be run with any kinds of restrictions, so long one takes care it can access provided SSH key files.