eirini Enhancement: app readiness should depend on all containers

we're playing with automatic sidecar injection from Istio.

looks like Eirini considers the app "Running" if only 1 of the containers is running.

we probably want to change it so that all (non-init) containers must be Ready before CF sees the App as running.

would y'all be open to a PR?

cc @tcdowney

#72
https://github.com/cloudfoundry/cf-k8s-networking

Nov 08 '19 23:11 rosenhouse

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/169660862

The labels on this github issue will be updated when the story is started.

Nov 08 '19 23:11 cf-gitbot

Hey - sounds good to us & a PR would be very welcome!

Nov 11 '19 10:11 julz

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/169674530

The labels on this github issue will be updated when the story is started.

Nov 11 '19 10:11 cf-gitbot

We're working on this and related things under the heading of "get CATS passing when Eirini has Istio sidecars"

Today's issue: In Diego, with multiple containers (system-provided Envoy, or user-provided sidecar), if any of the containers (garden peas) were to crash, then Diego will tear down and reschedule the whole pod.

In Kubernetes, this doesn't appear to be the behavior. Not default. Not even something directly support in a PodSpec -- we'd have to do a bunch of work (extra wiring somehow) to get K8s to mimic the Diego behavior.

It seems the K8s preferred behavior is restart the crashed container in-place, keeping the pod intact.

Does this sound right to folks?

@julz @alex-slynko @JulzDiverse

cc @emalm @zrob

Nov 19 '19 01:11 rosenhouse

I guess off the top of my head the first question is whether you actually need the Diego behaviour. If the sidecar crashes then it'll get restarted, at which point either (a) the main container starts working again or (b) the main container fails its health check, is restarted, works - either way the system is back up and running? Is there a case where we need the whole pod to be torn down if a container fails?

Nov 19 '19 09:11 julz

Using liveness checks to determine when to reschedule is actually much better than the Diego Codependent behavior for user provided sidecars, actually.

The user story that comes to mind is when you have a memory hungry APM agent (in Java or Ruby, for example) running next to a lighter weight app. If the APM sidecar exceeds its memory limits, ideally we OOM kill the APM and restart the pod without taking down the app.

The situation that's interesting here is to do in the absence of a user-provided health check... What the cc api calls a "process" type healthcheck. Should Eirini provide a liveness probe that confirms that the main process is running?

Nov 19 '19 18:11 cwlbraa

eirini eirini copied to clipboard

Enhancement: app readiness should depend on all containers

eirini
eirini copied to clipboard