Run multiple connectors in a single Deployment
What is missing?
I'd like to run multiple Twingate connectors for High Availability. This is easily accomplished now by creating multiple TwingateConnector resources and using some podAntiAffinity and topologySpreadConstraint.
However, this seems less than ideal because the operator will gladly tear down all running instances to apply changes made to the spec, for example when managing multiple connectors with a Helm chart.
Instead, it probably makes more sense to allow the user to specify replicas: 3 on the TwingateConnector and run these under a Deployment with multiple replicas.
I realize there may be some complexity here because each Twingate connector requires its own API key. Maybe a StatefulSet could help?
Why do we need it?
Better high availability. Using a Deployment or StatefulSet (or really anything backed by a ReplicaSet) helps ensure all connectors are online.
Hi @ctso ,
Thanks for the feature request.
Just to clarify, the TwingateConnector resource is currently backed by a Deployment object. We initially used a standalone Pod, but as you noted, Kubernetes wouldn’t restart it on failure. Switching to a Deployment solved that by ensuring availability via a ReplicaSet.
Regarding the ability to set replicas directly on the TwingateConnector, there’s a deeper discussion on a similar request in the helm chart repo.
In short, there are two main reasons we haven’t supported this yet:
- Provisioning complexity: Each connector instance needs to be provisioned individually in Twingate and requires its own access and refresh key pair. Mapping multiple replicas to their respective credentials in a scalable, error-resilient way is non-trivial.
- Scaling semantics: Allowing replicas might suggest that connectors are horizontally scalable and trigger interest in using HPAs, which isn’t something we want to encourage right now. Each connector is a fully stateful service which has connections bound to it and taking it down is a disruptive event that requires extra logic which would make it non trivial (like being able to mark a connector for deletion so it stops accepting new connections and then being able to wait until it has no existing connections)
That said, we completely understand the desire for a smoother HA experience, and we’re continuing to think about how to improve this in the future...
Thanks Eran
We are noticing this issue as well, lots of connector disconnected emails come in with our flow. We could just disable the notifications for but it would be nice to be able to catch connectors being offline for a while. A simple fix for that may be if there was a way to enable a grace period of being offline before sending a notification
Provisioning complexity: Each connector instance needs to be provisioned individually in Twingate and requires its own access and refresh key pair. Mapping multiple replicas to their respective credentials in a scalable, error-resilient way is non-trivial.
One potential solution could be to change the Deployment to a StatefulSet with 2 replicas hard-coded and then just generate 2 access/refresh tokens loaded in based on hostname. Alternatively you could switch to the operator handling the pod lifecycle itself by deleting and recreating pods as needed if they go down and injecting the tokens there
@devonwarren not sure how the "connector disconnect" notification is related. Even as a StatefulSet - those are 2 connectors and if any of them goes offline you'll get. a notification...
Can you explain your flow? Connectors is not something you want to be taking up and down frequently...
@ekampf sorry it looks like I hadn't paid close attention to the disconnected messages but it seems like there is grace period according to it. Looking at it further I'm wondering if there are networking issues with the cluster as it's failing the healthcheck somewhat frequently but the other clusters don't seem to been having that issue. I'll have to investigate that further
Even as a StatefulSet - those are 2 connectors and if any of them goes offline you'll get. a notification...
Yeah, I suppose that's true if those are counted as different connectors within Twingate. I think the notification grace period I wasn't originally aware of somewhat solves that issue as they should be back up within that time normally, thanks!
We need to ensure at least one connector (ideally 2) is available. There should at least have an option for PDB with minAvailable: 1 when deploying a TwingateConnector resource (like for karpenter node replacement). It is not good practice to have a single connector pod as the only way to access kubernetes resources (including the k8s API) from outside the cluster, which is a use case many folks use the wonderful TG operator for. You can understand why operators are nervous when the various apps & services that folks are hitting using the TG connector have multiple replicas along with PDB's and node anti-affinity settings and so we are just looking for the same level of resilience.
minAvailable: 1 on a deployment limited to 1 pod will basically block nodepool updates etc. and doesnt seem like the right solution here. I would define 2 TwingateConnector instances and use affinity to amke sure they get executed on different nodes in different zones…
Agreed, just wanting to get guidance on how we should run multiple connectors. For example currently use something like:
apiVersion: twingate.com/v1beta
kind: TwingateConnector
metadata:
name: clustername-connector
spec:
imagePolicy:
provider: dockerhub
schedule: "0 0 * * *"
As a follow up for others, here's what I did. Note: It should be tweaked as appropriate for zone/region.
apiVersion: twingate.com/v1beta
kind: TwingateConnector
metadata:
name: clustername-connector-a
spec:
imagePolicy:
provider: dockerhub
schedule: "0 0 * * *"
podLabels:
app.kubernetes.io/name: twingate-connector
podExtra:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: twingate-connector
topologyKey: kubernetes.io/hostname
weight: 100
---
apiVersion: twingate.com/v1beta
kind: TwingateConnector
metadata:
name: clustername-connector-b
spec:
imagePolicy:
provider: dockerhub
schedule: "0 0 * * *"
podLabels:
app.kubernetes.io/name: twingate-connector
podExtra:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: twingate-connector
topologyKey: kubernetes.io/hostname
weight: 100
We need to ensure at least one connector (ideally 2) is available. There should at least have an option for PDB with minAvailable: 1 when deploying a TwingateConnector resource (like for karpenter node replacement)
@hoopty In case it helps, you should be able to define PDBs independent of the TwingateConnector CRD as long as the corresponding labels match (as PDBs are independent objects). This isn't as neat as having it be part of the helm chart or twingate-provided CRD, but it should do the job. There are some documented caveats around PDBs for arbitrary pods.
Here's an example also including topologySpreadConstraints:
apiVersion: twingate.com/v1beta
kind: TwingateConnector
metadata:
name: clustername-connector-a
spec:
imagePolicy:
provider: dockerhub
schedule: "0 0 * * *"
podLabels:
app.kubernetes.io/name: twingate-connector
podExtra:
topologySpreadConstraints:
- labelSelector:
matchLabels:
app.kubernetes.io/name: twingate-connector
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
- labelSelector:
matchLabels:
app.kubernetes.io/name: twingate-connector
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
---
apiVersion: twingate.com/v1beta
kind: TwingateConnector
metadata:
name: clustername-connector-b
spec:
imagePolicy:
provider: dockerhub
schedule: "0 0 * * *"
podLabels:
app.kubernetes.io/name: twingate-connector
podExtra:
topologySpreadConstraints:
- labelSelector:
matchLabels:
app.kubernetes.io/name: twingate-connector
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
- labelSelector:
matchLabels:
app.kubernetes.io/name: twingate-connector
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: twingate-connector
spec:
minAvailable: 1
selector:
matchLabels:
app.kubernetes.io/name: twingate-connector
Unfortunately this doesn't help for redeployments - the pods are each treated independently for rollout (Recreate) purposes, so you end up with the likelihood of both pods getting killed and replaced at the same time, with downtime while they are both being replaced. So the original issue isn't resolved by this.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Change their update schedule property to not overlap?
Change their update schedule property to not overlap?
This might help for if your updates are purely being driven by image updates. But if you're e.g. modifying some configuration, you'll still end up with independent rollouts happening possibly concurrently, resulting in downtime. I'm not sure there is a way around this without coordinating the rollout using some first-class k8s entity like a deployment, or rolling an ersatz version.
I suppose you could modify the pods one at a time, but it does feel a bit clunky to DIY a rollout by hand.