flipt icon indicating copy to clipboard operation
flipt copied to clipboard

Sync from Git fails after feature flag edited on other pod

Open darrenclark opened this issue 2 months ago • 7 comments

We've deployed Flipt 2.3.1 (though this issue happens on 2.1.0 as well) on Kubernetes with two replicas (2).

It seems when we update a feature flag, one of the pods stops syncing from Git. I think the specific scenario is:

(using fiction pod names flipt-pod-a and flipt-pod-b)

  1. User edits feature flag via dashboard
  2. The "update flag" request gets load balanced to flipt-pod-a
  3. flipt-pod-a pushes to Git (and also has the correct local state)
  4. Next git sync, flipt-pod-b starts failing "closing writer: EOF" error

To fix this issue, flipt-pod-b must be restarted

Some logs:

Image

darrenclark avatar Oct 30 '25 16:10 darrenclark

This issue is caused by each Flipt pod maintaining its own local copy of the Git repository. When one pod pushes changes, the other pod's local repo can become out of sync or corrupted, leading to errors like "closing writer: EOF" on git fetch. Restarting the affected pod resets its local repo, but the problem will recur unless the underlying storage is addressed.

Best practice for multi-replica Flipt deployments with GitOps is to use a shared persistent volume for the Git repository, so all pods operate on the same repo state. Alternatively, configure Flipt to use a stateless approach where each sync clones/fetches from the remote without relying on local repo state. Using non-shared volumes (like emptyDir or pod-local PVCs) will almost always result in the kind of corruption and sync errors you're seeing after concurrent writes reference.

Recent versions of Flipt (v2.3.0+) also support a fetch_policy option, which can make the service more resilient to transient Git failures, but this does not resolve the underlying issue of local repo state divergence between pods reference.

To fix this, update your deployment so both pods share the same persistent volume for the Git repo path, or consider a stateless Git backend configuration if your workload allows. This will prevent local repo corruption and ensure reliable syncing across all pods.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Oct 30 '25 16:10 dosubot[bot]

@dosu I'm not sure if I should be debating with an AI on a Github PR or not 😅 but I'll bite..

Using non-shared volumes (like emptyDir or pod-local PVCs) will almost always result in the kind of corruption and sync errors you're seeing after concurrent writes

I would argue that non-shared volume should be more resilient to corruption, since only a single process is reading/writing to them.

Best practice for multi-replica Flipt deployments with GitOps is to use a shared persistent volume for the Git repository, so all pods operate on the same repo state

I would argue the opposite - keeping things separate to reduce blast radius should (in theory) lead to a more reliable system.

darrenclark avatar Oct 30 '25 16:10 darrenclark

Your reasoning about non-shared volumes makes sense in general—single-process access does reduce the risk of corruption from concurrent writes. However, with Flipt's GitOps model, the main issue is not classic filesystem corruption, but local Git repo state divergence between pods. When each pod has its own local clone, a push from one pod can leave the other pod's repo in a state that fails to fetch or merge cleanly, resulting in errors like "closing writer: EOF" until the pod is restarted and its repo is reset. This isn't about concurrent access to the same volume, but about Git's expectations for repo synchronization and atomicity, which aren't guaranteed when multiple independent clones are being updated asynchronously see discussion.

A shared persistent volume ensures all pods operate on the same repo state, preventing these sync errors. The trade-off is that a corrupting event affects all pods, but in practice, Flipt's design doesn't coordinate repo state between pods, so non-shared volumes can actually increase the risk of divergence and operational issues. If blast radius is a concern, another pattern is to run separate Flipt instances per team or environment, each with its own repo and volume see here. For multi-replica setups sharing the same repo, shared storage or a stateless clone-per-sync approach is the most reliable way to avoid the failure mode you're seeing.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Oct 30 '25 16:10 dosubot[bot]

thanks for reporting @darrenclark

does the EOF error happen consistently in this case for flipt-pod-b? or does it only happen some of the time?

markphelps avatar Oct 30 '25 17:10 markphelps

It happens every time I've tried.

It also seems to happen across environments. We share the same git repo/branch & use subfolders for dev vs prod - editing a flag in one env causes the other env to start getting the error.

I can confirm it happens with both local & memory backends. Relevant section from my config file:

      storage:
        github-flipt-state:
          backend:
-           path: /var/opt/flipt
-           type: local
+           type: memory
          branch: main
          credentials: github-flipt-state-deploy-key
          poll_interval: 60s
          remote: <redacted>

darrenclark avatar Oct 30 '25 18:10 darrenclark

@markphelps It seems cloning the Git repo via HTTPS (instead of SSH) fixes the issue 🎉

Some more debugging:

  • Seeing this comment: https://github.com/go-git/go-git/issues/1685#issuecomment-3459040125, I tried the latest go-git version but it didn't seem to fix it
    • Though, will say - I'm not too familiar with Go tooling - so maybe I actually didn't update the dependency properly 😆
  • Added some log statements and discovered the error is coming from here: https://github.com/flipt-io/flipt/blob/1c942c24cda982260b15ee48f3adf2700a2ca76b/internal/storage/git/repository.go#L374-L383

By the way - awesome work with the DEVELOPMENT.md and mage setup - made it super easy to build & run locally

darrenclark avatar Oct 30 '25 19:10 darrenclark

thank you @darrenclark for the sleuthing! great find I'll follow up on that issue to see if we can get the EOF resolved. glad theres a work around for the moment of using HTTPS over SSH when cloning

markphelps avatar Oct 31 '25 14:10 markphelps