mesh icon indicating copy to clipboard operation
mesh copied to clipboard

Controller should check for orphan shadow services

Open kevinpollet opened this issue 4 years ago • 8 comments

Feature Request

Proposal

To make sure that shadow services exist for services created before Maesh installation or when the controller is down, at startup, the controller list the existing services and create the missing shadow services.

A nice addition could be to check that all shadow services have a corresponding service to check that there are no orphans. e.g. when a service is deleted during controller downtime.

kevinpollet avatar May 11 '20 15:05 kevinpollet

In the past there was a poor mans "audit" that ran on a timer to verify all services were matched up, and that if there was a missing shadow service, trigger it as a new create event.

Did we want to go this route?

dtomcej avatar May 11 '20 15:05 dtomcej

I think doing this check at startup is enough. Maybe that's not necessary if the shadow service manager is idempotent (#562). This will ensure that orphan services will not disrupt the system.

One problem with the orphan shadow services is that they hold entry points for nothing.

kevinpollet avatar May 11 '20 15:05 kevinpollet

Just a thought, maybe we could use ownerReferences to make sure that shadow service are properly deleted. wdyt?

kevinpollet avatar May 11 '20 16:05 kevinpollet

oooo That is a neat idea. Letting kubernetes handle it automatically? I like it.

dtomcej avatar May 11 '20 16:05 dtomcej

Yep, that's the idea. If a shadow service has a user service as ownerReferences, it should be garbage collected when the user service is removed but, we still have to keep the TCP and UDP tables in sync to avoid port leaks. Maybe it will help us when the TCP and UDP tables will be computed at startup.

Btw, I think that we should try to handle orphan shadow services gracefully to avoid port leaks which will disrupt the system.

kevinpollet avatar May 12 '20 07:05 kevinpollet

@kevinpollet I did a test with the ownerReferences property. It works great but there's a side effect. We are relying on service deletion events to update state tables but when we receive this event the shadow service can already be deleted. And so, we don't know which mapping should be removed. There are solutions though:

  • Update the PortMapper and expose a RemoveAll method which takes a name and a namespace as parameter. It will then remove all port mapping for the given service. And so, we no longer need to inspect the shadow service to know which port should be unmapped.
  • Listen for ShadowService events (create, update, delete) to maintain the state tables.
  • Stop maintaining two distinct states, build the state table from the shadow services when we need them.

The first solution is the easiest to implement and the less disruptive. The second adds one more moving piece in the system and will introduce some delay. The third solution simplify the port mapping management but has a cost. We would have to build this table before each build of the topology.

I tend to prefer the 3rd option or eventually the 1st. @kevinpollet, @dtomcej Any opinion on this?

jspdown avatar Jul 17 '20 09:07 jspdown

@jspdown I think that if we use references, we lose a bit of control over the deletion events. We have to just "trust" that things will be handled properly by k8s. We would have to run an "audit" loop to verify that k8s has in fact, removed the objects it was supposed to.

However, isn't the goal to remove the persistent state table entirely, and use the shadow services as the source of truth?

dtomcej avatar Jul 20 '20 18:07 dtomcej

Using ownerReferences is not an option for us. Shadow services and user services are in different namespaces. The kubernetes documentation on garbage collection says:

Cross-namespace owner references are disallowed by design. This means:

  1. Namespace-scoped dependents can only specify owners in the same namespace, and owners that are cluster-scoped.
  2. Cluster-scoped dependents can only specify cluster-scoped owners, but not namespace-scoped owners.

https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/

jspdown avatar Sep 17 '20 10:09 jspdown