cf-for-k8s Single call to System.exit(1) in Java app kills all healthy instances in scaled set

Hi, I've been seeing a behavior of cf-for-k8s recently, which I think is not supposed to be.

What I normally demo is simple application that I can trigger to kill by REST Call. I normally scale it up prior to that to show how CF isolates the broken instance and routes the traffic to the healthy ones (and preserves HA for the app and regular behaviour to the end user). However if I kill one, it seems to take down the healthy instances as well (thus exposing an outage to the end user).

Example:

$ cf apps
Getting apps in org system / space dev as admin...
OK

name         requested state   instances   memory   disk   urls
simplecode   started           3/3         1G       1G     simplecode.apps.20.54.149.63.xip.io

and I am watching:

Every 1.0s: kubectl get all -n cf-workloads               macbookmhs: Mon Jul 13 21:39:36 2020

NAME                              READY   STATUS    RESTARTS   AGE
pod/simplecode-dev-f9adad988d-0   2/2     Running   3          168m
pod/simplecode-dev-f9adad988d-1   2/2     Running   2          166m
pod/simplecode-dev-f9adad988d-2   2/2     Running   2          166m

I trigger the call:

$ curl simplecode.apps.20.54.149.63.xip.io/fail
upstream connect error or disconnect/reset before headers. reset reason: connection termination

which should only affect the one instance that got killed, instead I get:

Every 1.0s: kubectl get all -n cf-workloads               macbookmhs: Mon Jul 13 21:38:19 2020
NAME                              READY   STATUS             RESTARTS   AGE
pod/simplecode-dev-f9adad988d-0   1/2     Error              3          169m
pod/simplecode-dev-f9adad988d-1   1/2     CrashLoopBackOff   2          167m
pod/simplecode-dev-f9adad988d-2   1/2     Error              2          167m

It eventually recovers.

I added some logging statements to the method which causes the crash before the JVM kill:

@GetMapping("/fail")
    String fail() {
        logger.info("Call to failing method on instance: "+getInstanceId());
        System.exit(1);
        return "something";
    }

If I now invoke this method and check the logs:

$ stern simplecode -n cf-workloads | grep Call
simplecode-dev-dbcb3c95b1-2 opi 2020-07-16 16:42:01.012  INFO 1 --- [io-8080-exec-10] d.m.simplecode.SimplecodeApplication     : Call to failing method on instance: simplecode-dev-dbcb3c95b1-2
simplecode-dev-dbcb3c95b1-0 opi 2020-07-16 16:42:04.166  INFO 1 --- [nio-8080-exec-2] d.m.simplecode.SimplecodeApplication     : Call to failing method on instance: simplecode-dev-dbcb3c95b1-0
simplecode-dev-dbcb3c95b1-1 opi 2020-07-16 16:42:07.355  INFO 1 --- [nio-8080-exec-1] d.m.simplecode.SimplecodeApplication     : Call to failing method on instance: simplecode-dev-dbcb3c95b1-1

So even if it is only one call, it gets sent to each healthy instance after an instance crash until there are no instances left.

Last test has been done with:

$ git show
Thu Jul 16 08:49:27 2020 -0700 a228c56 (HEAD -> master, origin/master, origin/develop, origin/HEAD) Merge pull request #285 from akhinos/fix-pr-template  [Paul Warren]

On Microsoft AKS / Kubernetes v1.16.10:

$ kubectl get nodes -o wide
NAME                                STATUS   ROLES   AGE    VERSION    INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-nodepool1-20647587-vmss000000   Ready    agent   108m   v1.16.10   10.240.0.4    <none>        Ubuntu 16.04.6 LTS   4.15.0-1089-azure   docker://3.0.10+azure

Test has been done with this code: https://github.com/maeddes/simplecode

I pushed it directly from source, so no pre-built jar. If you need any other infos, please ping me.

Jul 16 '20 16:07 maeddes

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/173836799

The labels on this github issue will be updated when the story is started.

Jul 16 '20 16:07 cf-gitbot

@maeddes I'm having difficulty pushing the app. I've checked out the repo and run a cf push simplecode from the root directory, but I consistently get staging timeouts while it is downloading the maven packages. Anything I'm doing obviously wrong here?

Interesting thing is I think it keeps getting hung up on this dependency:

Downloading from central: https://repo.maven.apache.org/maven2/org/apache/logging/log4j/log4j-api/2.13.3/log4j-api-2.13.3.jar

Edit: I'm able to deploy it now. Not sure what was up there.

Jul 17 '20 18:07 tcdowney

There's been some discussion for this issue on Slack. As people suspected, it is due to automatic retries. It looks like by default it is retrying the request 2 times on the following retry_on criteria:

"retry_on": "connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes",

You can see the 2 retries in action by scaling the app up to five instances. You'll see that only three of them will end up crashing.

@rosenhouse pointed out that that Gorouter purposefully does not retry on another backend in cases like this which explains the discrepancy.

We should be able to configure this behavior through Istio on the VirtualService so I'll make a story in our backlog to track this.

Every 0.5s: kubectl get pods -n cf-workloads                                                                                                                                                             canary: Fri Jul 17 20:50:24 2020

NAME                        READY   STATUS             RESTARTS   AGE
simplecode-s-4fdcf1edfd-0   1/2     CrashLoopBackOff   3          22m
simplecode-s-4fdcf1edfd-1   1/2     CrashLoopBackOff   4          16m
simplecode-s-4fdcf1edfd-2   1/2     Error              4          16m
simplecode-s-4fdcf1edfd-3   2/2     Running            1          8m55s
simplecode-s-4fdcf1edfd-4   2/2     Running            0          8m55s

Relevant config from an Envoy config_dump:

"name": "simplecode.apps.tim2.routing.lol:80",
        "domains": [
         "simplecode.apps.tim2.routing.lol",
         "simplecode.apps.tim2.routing.lol:80"
        ],
        "routes": [
         {
          "match": {
           "prefix": "/"
          },
          "route": {
           "cluster": "outbound|8080||s-a7ed36b5-b90a-4423-87cc-af268dedae7a.cf-workloads.svc.cluster.local",
           "timeout": "0s",
           "retry_policy": {
            "retry_on": "connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes",
            "num_retries": 2,
            "retry_host_predicate": [
             {
              "name": "envoy.retry_host_predicates.previous_hosts"
             }
            ],
            "host_selection_retry_max_attempts": "5",
            "retriable_status_codes": [
             503
            ]
           },
           "max_grpc_timeout": "0s"
          },
          "metadata": {
           "filter_metadata": {
            "istio": {
             "config": "/apis/networking/v1alpha3/namespaces/cf-workloads/virtual-service/vs-24d9e2f6d5cd45e3a7ad83b34f72353375af68b25292f0f8ac8dd19d73ec3bc2"
            }
           }
          },

Jul 17 '20 20:07 tcdowney

Here's the cf-k8s-networking story: #173861292

Jul 17 '20 21:07 tcdowney

cf-for-k8s cf-for-k8s copied to clipboard

Single call to System.exit(1) in Java app kills all healthy instances in scaled set

cf-for-k8s
cf-for-k8s copied to clipboard