cf-for-k8s
cf-for-k8s copied to clipboard
Single call to System.exit(1) in Java app kills all healthy instances in scaled set
Hi, I've been seeing a behavior of cf-for-k8s recently, which I think is not supposed to be.
What I normally demo is simple application that I can trigger to kill by REST Call. I normally scale it up prior to that to show how CF isolates the broken instance and routes the traffic to the healthy ones (and preserves HA for the app and regular behaviour to the end user). However if I kill one, it seems to take down the healthy instances as well (thus exposing an outage to the end user).
Example:
$ cf apps
Getting apps in org system / space dev as admin...
OK
name requested state instances memory disk urls
simplecode started 3/3 1G 1G simplecode.apps.20.54.149.63.xip.io
and I am watching:
Every 1.0s: kubectl get all -n cf-workloads macbookmhs: Mon Jul 13 21:39:36 2020
NAME READY STATUS RESTARTS AGE
pod/simplecode-dev-f9adad988d-0 2/2 Running 3 168m
pod/simplecode-dev-f9adad988d-1 2/2 Running 2 166m
pod/simplecode-dev-f9adad988d-2 2/2 Running 2 166m
I trigger the call:
$ curl simplecode.apps.20.54.149.63.xip.io/fail
upstream connect error or disconnect/reset before headers. reset reason: connection termination
which should only affect the one instance that got killed, instead I get:
Every 1.0s: kubectl get all -n cf-workloads macbookmhs: Mon Jul 13 21:38:19 2020
NAME READY STATUS RESTARTS AGE
pod/simplecode-dev-f9adad988d-0 1/2 Error 3 169m
pod/simplecode-dev-f9adad988d-1 1/2 CrashLoopBackOff 2 167m
pod/simplecode-dev-f9adad988d-2 1/2 Error 2 167m
It eventually recovers.
I added some logging statements to the method which causes the crash before the JVM kill:
@GetMapping("/fail")
String fail() {
logger.info("Call to failing method on instance: "+getInstanceId());
System.exit(1);
return "something";
}
If I now invoke this method and check the logs:
$ stern simplecode -n cf-workloads | grep Call
simplecode-dev-dbcb3c95b1-2 opi 2020-07-16 16:42:01.012 INFO 1 --- [io-8080-exec-10] d.m.simplecode.SimplecodeApplication : Call to failing method on instance: simplecode-dev-dbcb3c95b1-2
simplecode-dev-dbcb3c95b1-0 opi 2020-07-16 16:42:04.166 INFO 1 --- [nio-8080-exec-2] d.m.simplecode.SimplecodeApplication : Call to failing method on instance: simplecode-dev-dbcb3c95b1-0
simplecode-dev-dbcb3c95b1-1 opi 2020-07-16 16:42:07.355 INFO 1 --- [nio-8080-exec-1] d.m.simplecode.SimplecodeApplication : Call to failing method on instance: simplecode-dev-dbcb3c95b1-1
So even if it is only one call, it gets sent to each healthy instance after an instance crash until there are no instances left.
Last test has been done with:
$ git show
Thu Jul 16 08:49:27 2020 -0700 a228c56 (HEAD -> master, origin/master, origin/develop, origin/HEAD) Merge pull request #285 from akhinos/fix-pr-template [Paul Warren]
On Microsoft AKS / Kubernetes v1.16.10:
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
aks-nodepool1-20647587-vmss000000 Ready agent 108m v1.16.10 10.240.0.4 <none> Ubuntu 16.04.6 LTS 4.15.0-1089-azure docker://3.0.10+azure
Test has been done with this code: https://github.com/maeddes/simplecode
I pushed it directly from source, so no pre-built jar. If you need any other infos, please ping me.
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/173836799
The labels on this github issue will be updated when the story is started.
@maeddes I'm having difficulty pushing the app. I've checked out the repo and run a cf push simplecode
from the root directory, but I consistently get staging timeouts while it is downloading the maven packages. Anything I'm doing obviously wrong here?
Interesting thing is I think it keeps getting hung up on this dependency:
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/logging/log4j/log4j-api/2.13.3/log4j-api-2.13.3.jar
Edit: I'm able to deploy it now. Not sure what was up there.
There's been some discussion for this issue on Slack. As people suspected, it is due to automatic retries. It looks like by default it is retrying the request 2 times on the following retry_on criteria:
"retry_on": "connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes",
You can see the 2 retries in action by scaling the app up to five instances. You'll see that only three of them will end up crashing.
@rosenhouse pointed out that that Gorouter purposefully does not retry on another backend in cases like this which explains the discrepancy.
We should be able to configure this behavior through Istio on the VirtualService
so I'll make a story in our backlog to track this.
Every 0.5s: kubectl get pods -n cf-workloads canary: Fri Jul 17 20:50:24 2020
NAME READY STATUS RESTARTS AGE
simplecode-s-4fdcf1edfd-0 1/2 CrashLoopBackOff 3 22m
simplecode-s-4fdcf1edfd-1 1/2 CrashLoopBackOff 4 16m
simplecode-s-4fdcf1edfd-2 1/2 Error 4 16m
simplecode-s-4fdcf1edfd-3 2/2 Running 1 8m55s
simplecode-s-4fdcf1edfd-4 2/2 Running 0 8m55s
Relevant config from an Envoy config_dump
:
"name": "simplecode.apps.tim2.routing.lol:80",
"domains": [
"simplecode.apps.tim2.routing.lol",
"simplecode.apps.tim2.routing.lol:80"
],
"routes": [
{
"match": {
"prefix": "/"
},
"route": {
"cluster": "outbound|8080||s-a7ed36b5-b90a-4423-87cc-af268dedae7a.cf-workloads.svc.cluster.local",
"timeout": "0s",
"retry_policy": {
"retry_on": "connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes",
"num_retries": 2,
"retry_host_predicate": [
{
"name": "envoy.retry_host_predicates.previous_hosts"
}
],
"host_selection_retry_max_attempts": "5",
"retriable_status_codes": [
503
]
},
"max_grpc_timeout": "0s"
},
"metadata": {
"filter_metadata": {
"istio": {
"config": "/apis/networking/v1alpha3/namespaces/cf-workloads/virtual-service/vs-24d9e2f6d5cd45e3a7ad83b34f72353375af68b25292f0f8ac8dd19d73ec3bc2"
}
}
},
Here's the cf-k8s-networking story: #173861292