IngressMonitorController icon indicating copy to clipboard operation
IngressMonitorController copied to clipboard

Failure communicating with Statuscake API resulting in duplicate tests

Open osilva opened this issue 4 years ago • 4 comments

We are running v2.1.10 on a v1.19.12 cluster with tests/monitors being created in Statuscake. We find that duplicate Monitors are being created and when I check the IMC pod I find:

{ "level": "error", "ts": 1631202713.767603, "logger": "statuscake-monitor", "msg": "Unable to retrieve monitor", "error": "Get \"https://app.statuscake.com/API/Tests/\": read tcp 10.51.4.130:50776->104.20.73.215:443: read: connection reset by peer", "stacktrace": "github.com/go-logr/zapr.(*zapLogger).Error \t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132 sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error ...

So it makes sense that if there is no communication with Statuscake then the Monitor will be created, which it is.

I am not sure if the problem is with IMC or with Statuscake and am trying to get more information. So far I have not found errors on the underlying node, in the cluster, or with the network but at the same time I don't exactly know what is happening inside the IMC container.

I'm new-ish to this so it's VERY LIKELY I am missing something obvious but I am unable to exec into the container itself:

OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "exec: \"/bin/sh\": stat /bin/sh: no such file or directory"

I see others have reported duplicate test creation as well but did not find a resolution.

The full error: {"level":"error","ts":1631257648.135457,"logger":"statuscake-monitor","msg":"Unable to retrieve monitor","error":"Get \"https://app.statuscake.com/API/Tests/\": read tcp 10.51.4.130:45626->104.20.73.215:443: read: connection reset by peer","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/deleg.go:144\ngithub.com/stakater/IngressMonitorController/pkg/monitors/statuscake.(*StatusCakeMonitorService).GetAll\n\t/workspace/pkg/monitors/statuscake/statuscake-monitor.go:231\ngithub.com/stakater/IngressMonitorController/pkg/monitors/statuscake.(*StatusCakeMonitorService).GetByName\n\t/workspace/pkg/monitors/statuscake/statuscake-monitor.go:203\ngithub.com/stakater/IngressMonitorController/pkg/monitors.(*MonitorServiceProxy).GetByName\n\t/workspace/pkg/monitors/monitor-proxy.go:84\ngithub.com/stakater/IngressMonitorController/pkg/controllers.findMonitorByName\n\t/workspace/pkg/controllers/endpointmonitor_util.go:10\ngithub.com/stakater/IngressMonitorController/pkg/controllers.(*EndpointMonitorReconciler).Reconcile\n\t/workspace/pkg/controllers/endpointmonitor_controller.go:88\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99"} {"level":"info","ts":1631257648.136304,"logger":"controllers.EndpointMonitor","msg":"Creating Monitor: test-service.mlsdevcloudfake.com-k8s-cluster-platform","endpointMonitor":"k8s-cluster-platform"} {"level":"info","ts":1631257648.589671,"logger":"statuscake-monitor","msg":"Monitor Added: 6114024"}

osilva avatar Sep 10 '21 09:09 osilva

I realize we are running IMC on multiple clusters using the same credentials, so it's also possible we are hitting rate limits on the API on the Statuscake side. If so, it would be good to maybe have a random back-off, or retry on verifying whether a monitor exists.

osilva avatar Sep 10 '21 12:09 osilva

We are also experiencing this. Running on multiple clusters while only 2 of them seem to have this issue.

Dadavan avatar Sep 12 '21 09:09 Dadavan

We've been told by Statuscake support that it's likely an issue with API rate limiting. Is it possible to have IMC note the problem and retry or have a back-off time?

osilva avatar Sep 15 '21 09:09 osilva

I think the problem is with the GetAll() function. Maybe it should also return an error, that way if an error is returned from it the reconciliation loop should continue without creating a new monitor (a new monitor should be created only if the monitor is nil, not if the actual GetAll() function fails, in this case it should just print the error and retry on the next iteration). The thing is that this will affect the MonitorService interface and therefore all of the service types and not only StatusCake would need to be updated. What do you think? EDIT: This is the exact same issue as #293

Dadavan avatar Oct 04 '21 14:10 Dadavan

Closing in favor of https://github.com/stakater/IngressMonitorController/issues/293

karl-johan-grahn avatar Feb 01 '23 10:02 karl-johan-grahn