kubernetes-operator icon indicating copy to clipboard operation
kubernetes-operator copied to clipboard

Run multiple connectors in a single Deployment

Open ctso opened this issue 6 months ago • 13 comments

What is missing? I'd like to run multiple Twingate connectors for High Availability. This is easily accomplished now by creating multiple TwingateConnector resources and using some podAntiAffinity and topologySpreadConstraint.

However, this seems less than ideal because the operator will gladly tear down all running instances to apply changes made to the spec, for example when managing multiple connectors with a Helm chart.

Instead, it probably makes more sense to allow the user to specify replicas: 3 on the TwingateConnector and run these under a Deployment with multiple replicas.

I realize there may be some complexity here because each Twingate connector requires its own API key. Maybe a StatefulSet could help?

Why do we need it? Better high availability. Using a Deployment or StatefulSet (or really anything backed by a ReplicaSet) helps ensure all connectors are online.

ctso avatar Jul 11 '25 00:07 ctso

Hi @ctso ,

Thanks for the feature request. Just to clarify, the TwingateConnector resource is currently backed by a Deployment object. We initially used a standalone Pod, but as you noted, Kubernetes wouldn’t restart it on failure. Switching to a Deployment solved that by ensuring availability via a ReplicaSet.

Regarding the ability to set replicas directly on the TwingateConnector, there’s a deeper discussion on a similar request in the helm chart repo. In short, there are two main reasons we haven’t supported this yet:

  • Provisioning complexity: Each connector instance needs to be provisioned individually in Twingate and requires its own access and refresh key pair. Mapping multiple replicas to their respective credentials in a scalable, error-resilient way is non-trivial.
  • Scaling semantics: Allowing replicas might suggest that connectors are horizontally scalable and trigger interest in using HPAs, which isn’t something we want to encourage right now. Each connector is a fully stateful service which has connections bound to it and taking it down is a disruptive event that requires extra logic which would make it non trivial (like being able to mark a connector for deletion so it stops accepting new connections and then being able to wait until it has no existing connections)

That said, we completely understand the desire for a smoother HA experience, and we’re continuing to think about how to improve this in the future...

Thanks Eran

ekampf avatar Jul 30 '25 20:07 ekampf

We are noticing this issue as well, lots of connector disconnected emails come in with our flow. We could just disable the notifications for but it would be nice to be able to catch connectors being offline for a while. A simple fix for that may be if there was a way to enable a grace period of being offline before sending a notification

Provisioning complexity: Each connector instance needs to be provisioned individually in Twingate and requires its own access and refresh key pair. Mapping multiple replicas to their respective credentials in a scalable, error-resilient way is non-trivial.

One potential solution could be to change the Deployment to a StatefulSet with 2 replicas hard-coded and then just generate 2 access/refresh tokens loaded in based on hostname. Alternatively you could switch to the operator handling the pod lifecycle itself by deleting and recreating pods as needed if they go down and injecting the tokens there

devonwarren avatar Aug 22 '25 17:08 devonwarren

@devonwarren not sure how the "connector disconnect" notification is related. Even as a StatefulSet - those are 2 connectors and if any of them goes offline you'll get. a notification...

Can you explain your flow? Connectors is not something you want to be taking up and down frequently...

ekampf avatar Aug 22 '25 19:08 ekampf

@ekampf sorry it looks like I hadn't paid close attention to the disconnected messages but it seems like there is grace period according to it. Looking at it further I'm wondering if there are networking issues with the cluster as it's failing the healthcheck somewhat frequently but the other clusters don't seem to been having that issue. I'll have to investigate that further

Even as a StatefulSet - those are 2 connectors and if any of them goes offline you'll get. a notification...

Yeah, I suppose that's true if those are counted as different connectors within Twingate. I think the notification grace period I wasn't originally aware of somewhat solves that issue as they should be back up within that time normally, thanks!

devonwarren avatar Aug 26 '25 13:08 devonwarren

We need to ensure at least one connector (ideally 2) is available. There should at least have an option for PDB with minAvailable: 1 when deploying a TwingateConnector resource (like for karpenter node replacement). It is not good practice to have a single connector pod as the only way to access kubernetes resources (including the k8s API) from outside the cluster, which is a use case many folks use the wonderful TG operator for. You can understand why operators are nervous when the various apps & services that folks are hitting using the TG connector have multiple replicas along with PDB's and node anti-affinity settings and so we are just looking for the same level of resilience.

hoopty avatar Sep 16 '25 16:09 hoopty

minAvailable: 1 on a deployment limited to 1 pod will basically block nodepool updates etc. and doesnt seem like the right solution here. I would define 2 TwingateConnector instances and use affinity to amke sure they get executed on different nodes in different zones…

ekampf avatar Sep 16 '25 16:09 ekampf

Agreed, just wanting to get guidance on how we should run multiple connectors. For example currently use something like:

apiVersion: twingate.com/v1beta
kind: TwingateConnector
metadata:
  name: clustername-connector
spec:
  imagePolicy:
    provider: dockerhub
    schedule: "0 0 * * *"

hoopty avatar Sep 16 '25 17:09 hoopty

As a follow up for others, here's what I did. Note: It should be tweaked as appropriate for zone/region.

apiVersion: twingate.com/v1beta
kind: TwingateConnector
metadata:
  name: clustername-connector-a
spec:
  imagePolicy:
    provider: dockerhub
    schedule: "0 0 * * *"
  podLabels:
    app.kubernetes.io/name: twingate-connector
  podExtra:
    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/name: twingate-connector
              topologyKey: kubernetes.io/hostname
            weight: 100
---
apiVersion: twingate.com/v1beta
kind: TwingateConnector
metadata:
  name: clustername-connector-b
spec:
  imagePolicy:
    provider: dockerhub
    schedule: "0 0 * * *"
  podLabels:
    app.kubernetes.io/name: twingate-connector
  podExtra:
    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/name: twingate-connector
              topologyKey: kubernetes.io/hostname
            weight: 100

hoopty avatar Sep 16 '25 18:09 hoopty

We need to ensure at least one connector (ideally 2) is available. There should at least have an option for PDB with minAvailable: 1 when deploying a TwingateConnector resource (like for karpenter node replacement)

@hoopty In case it helps, you should be able to define PDBs independent of the TwingateConnector CRD as long as the corresponding labels match (as PDBs are independent objects). This isn't as neat as having it be part of the helm chart or twingate-provided CRD, but it should do the job. There are some documented caveats around PDBs for arbitrary pods.

Here's an example also including topologySpreadConstraints:

apiVersion: twingate.com/v1beta
kind: TwingateConnector
metadata:
  name: clustername-connector-a
spec:
  imagePolicy:
    provider: dockerhub
    schedule: "0 0 * * *"
  podLabels:
    app.kubernetes.io/name: twingate-connector
  podExtra:
    topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: twingate-connector
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: twingate-connector
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
---
apiVersion: twingate.com/v1beta
kind: TwingateConnector
metadata:
  name: clustername-connector-b
spec:
  imagePolicy:
    provider: dockerhub
    schedule: "0 0 * * *"
  podLabels:
    app.kubernetes.io/name: twingate-connector
  podExtra:
    topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: twingate-connector
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: twingate-connector
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: twingate-connector
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: twingate-connector

Unfortunately this doesn't help for redeployments - the pods are each treated independently for rollout (Recreate) purposes, so you end up with the likelihood of both pods getting killed and replaced at the same time, with downtime while they are both being replaced. So the original issue isn't resolved by this.

MIJOTHY-V2 avatar Sep 17 '25 08:09 MIJOTHY-V2

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Nov 16 '25 14:11 github-actions[bot]

Change their update schedule property to not overlap?

ekampf avatar Nov 17 '25 15:11 ekampf

Change their update schedule property to not overlap?

This might help for if your updates are purely being driven by image updates. But if you're e.g. modifying some configuration, you'll still end up with independent rollouts happening possibly concurrently, resulting in downtime. I'm not sure there is a way around this without coordinating the rollout using some first-class k8s entity like a deployment, or rolling an ersatz version.

I suppose you could modify the pods one at a time, but it does feel a bit clunky to DIY a rollout by hand.

MIJOTHY-V2 avatar Nov 19 '25 14:11 MIJOTHY-V2