atlantis Highly available cluster with multiple nodes

We are trying to set up a highly available Atlantis cluster with multiple nodes for prod environment and currently testing with two nodes behind a load balancer. In order to have the nodes with the same data/status we deployed Atlantis data folder as a common file share (Azure files) and mounted this share to both nodes, but unfortunately both nodes start to fail and send application exceptions that I attached.

Questions: Can the same set of data files shared among multiple Atlantis server instances as we envisioned? Is this issue due to specific file locking mechanism of Atlantis? Can this issue fixed by any code change or this is not easily achieved by smaller amount of code change. We have the intention to put development effort into it if it is easily achievable. Generally, what is the advise/best practice in order to have a highly available Atlantis environment with multiple nodes?

AtlantisException

May 10 '21 11:05 tapaszto

Atlantis was not designed to be set up with multiples nodes. I think this will require a significant amount of code changes to be achieved. Basically, you will have to replace boltdb with some distributed DB and make tons of code changes to be able to sync properly, keep status, etc.

Usually, Atlantis users do not have highly available Atlantis servers, they have a big instance or multiple instances running different webhooks integrations ( maybe behind the same LB).

Now talking about the reason of having such a setup, why does it needs to be HA? Infra as code is not service so it does not have service dependencies, meaning it does not need to be "up".

May 10 '21 17:05 jamengual

As Pepe mentioned our reliance on BoltDB is really the limiting factor here. Bolt is intended to be used as an embedded database for applications and cannot be safely shared between processes. There has been a few PR discussions around creating a unified abstraction over database access to allow for pluggable database providers but to my knowledge no work has been done yet.

I believe it would be possible to run multiple atlantis instances with project configuration to limit each instance to only handling a subset of files but it is not possible to run multiple instances that function as one server.

May 10 '21 23:05 acastle

Hi @jamengual & @acastle,

As I can see there is a Locker interface and it's implemented by boltdb.go, this persists the state into atlantis.db file. Can we achieve our goal by creating a new implementation of the Locker interface which connects to a distributed DB like Azure Cosmos DB? Are there any other server state files besides atlantis.db? What other tasks are required besides the new implementation of the Locker interface? E.g. new configuration settings, initiating the new implementation instead of the current one according to specific server setting, etc. Can you provide us a more granular work item list please? We are trying to have a better understanding to be able to estimate the required effort of this development.

May 11 '21 12:05 tapaszto

This is a lot of work just to itemize the needed changes and right now this is out of scope for us.

May 12 '21 16:05 jamengual

https://github.com/runatlantis/atlantis/issues/265#issuecomment-481730130 talks about some of the work. The locker isn't the hard part. It's the reliance on the filesystem for storing plans and for knowing which PRs are in progress.

May 13 '21 00:05 lkysow

in the docs

Atlantis has no external database. Atlantis stores Terraform plan files on disk. If Atlantis loses that data in between a plan and apply cycle, then users will have to re-run plan. Because of this, you may want to provision a persistent disk for Atlantis.

I setup EFS and specify the ATLANTIS_DATA_DIR as the mount. My first instance started fine. but when I made some other changes, Fargate started the second instance before the first instance gets killed.....which failed with

Error: initializing server: starting BoltDB: timeout (a possible cause is another Atlantis instance already running)

so my question is, can a BoltDB created from instance "A" on EFS be picked up and used by instance "B"?

I think we can make FG completely kill the old container before staring the new one....if so....all the locks will be available to the new instance, so devs don't have to "re-plan"..... but will Bolt have issues in that design

Jul 30 '21 15:07 jasonrberk

We currently run single atlantis instance, but I landed here, because I was considering Provider Plugin Cache for Atlantis and it explicitly mentions that

Note: The plugin cache directory is not guaranteed to be concurrency safe. The provider installer's behavior in environments with multiple terraform init calls is undefined.

so I got interested whether it is possible to run multiple replicas of atlantis and whether the cache should be per replica or not.

The answer for me is that I don't need to consider multiinstance scenario just yet - I think I might not be the only one and it would be worth mentioning in the Atlantis Docs, that it is expected to run just a single Atlantis replica.

EDIT: I am also wondering how one can run Atlantis as Kubernetes Deployment, where it is not guaranteed, that there will always be just a single replica.

Apr 28 '22 14:04 dohnto

As Pepe mentioned our reliance on BoltDB is really the limiting factor here. Bolt is intended to be used as an embedded database for applications and cannot be safely shared between processes. There has been a few PR discussions around creating a unified abstraction over database access to allow for pluggable database providers but to my knowledge no work has been done yet.

I believe it would be possible to run multiple atlantis instances with project configuration to limit each instance to only handling a subset of files but it is not possible to run multiple instances that function as one server.

Hi @jamengual & @acastle,

I would like to follow up this topic as the Redis locking DB is available now. Referring to my original question, is it feasible to use Atlantis with multiple nodes as of now? I can envision a two tenant "cluster" environment with an active and passive node, the locking DB is hosted in Azure Redis and the working directory is on a shared drive. Only one node is active at any time period in order to avoid any interference, a load balancer (e.g. Azure Traffic Manager) would monitor the active node and the nodes could be swapped in case of any issue of the active one. Is this design feasible?

Nov 03 '22 14:11 tapaszto

so to have Ha with Atlantis using Redis you still need a way to share tha Atlantis data dir between containers, if you do that you can have active active containers and some people already running like that.

Nov 03 '22 22:11 jamengual

so to have Ha with Atlantis using Redis you still need a way to share tha Atlantis data dir between containers, if you do that you can have active active containers and some people already running like that.

Sharing the data dir is easily achievable in our Azure environment. But won't we have any issues when multiple active nodes are writing the same data dir files? Does the current Atlantis design exclude this?

Nov 03 '22 23:11 tapaszto

no, because the lock now is on redis.( if you enable it)

On Thu, Nov 3, 2022 at 4:12 PM Istvan Tapaszto @.***> wrote:

so to have Ha with Atlantis using Redis you still need a way to share tha Atlantis data dir between containers, if you do that you can have active active containers and some people already running like that.

Sharing the data dir is easily achievable in our Azure environment. But won't we have any issues when multiple active nodes are writing the same data dir files? Does the current Atlantis design exclude this?

— Reply to this email directly, view it on GitHub https://github.com/runatlantis/atlantis/issues/1571#issuecomment-1302777980, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ3EREXSG2G4IIOAWK54MLWGRBHPANCNFSM44RAFHYA . You are receiving this because you were mentioned.Message ID: @.***>

Nov 03 '22 23:11 jamengual

To close this ticket, I think some official docs are needed on how redis locking can be used to spin up more than one instance/pod of atlantis

Nov 07 '22 20:11 nitrocode

I agree.

On Mon, Nov 7, 2022 at 12:10 PM nitrocode @.***> wrote:

To close this ticket, I think some official docs are needed on how redis locking can be used to spin up more than one instance/pod of atlantis

— Reply to this email directly, view it on GitHub https://github.com/runatlantis/atlantis/issues/1571#issuecomment-1306130098, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ3ERBHJBLNJ4IT6BBA4CDWHFOZXANCNFSM44RAFHYA . You are receiving this because you were mentioned.Message ID: @.***>

Nov 07 '22 20:11 jamengual

Any news when this will be officially documented regarding implementation? This also blocks: https://github.com/terraform-aws-modules/terraform-aws-atlantis/issues/322

Jan 16 '23 11:01 gartemiev

@gartemiev none. This is an open source project and we depend 100% on user contributions. Please feel free to try out this feature, experiment, and see what works. If you can get it working and document it, everyone would appreciate it.

Jan 16 '23 16:01 nitrocode

did you enable Redis locking? are you running parallel plans and applies?

On Mon, Jan 16, 2023 at 8:26 AM nitrocode @.***> wrote:

@gartemiev https://github.com/gartemiev none. This is an open source project and we depend 100% on user contributions. Please feel free to try out this feature, experiment, and see what works. If you can get it working and document it, everyone would appreciate it.

— Reply to this email directly, view it on GitHub https://github.com/runatlantis/atlantis/issues/1571#issuecomment-1384290626, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ3EREBTMB6FDUH7JMYECLWSVZDNANCNFSM44RAFHYA . You are receiving this because you were mentioned.Message ID: @.***>

Jan 16 '23 17:01 jamengual

So in order to have HA in Atlantis:

Load Balancer in front of Atlantis "Cluster"
The actual Atlantis "Cluster" in our scenario ECS
Share disk space between all Nodes/Containers in the Cluster
Switch locking-db-type to Redis

I'll test this

Relates to:

https://github.com/runatlantis/atlantis/pull/2491#issuecomment-1446736952

Mar 02 '23 11:03 albertorm95

You may not need to share disk space. I'm unsure of this since i haven't tested it, but it's possible that redis is housing not only the lock but possibly the plans as well.

Please test with shared disk space and without. This will be handy in documentation on the website

Mar 02 '23 13:03 nitrocode

NFS or any shared file system is required since the plans are NOT store in Redis.

There were some weird behaviours like this error being shown in multiple atlantis instances and multiple times in some of them, this might not be related to the solution:

{"level":"error","ts":"2023-03-03T16:04:32.624Z","caller":"logging/simple_logger.go:163","msg":"invalid key: b5bacfe9-e187-4e6b-af0a-d169b785e0a2","json":{},"stacktrace":"github.com/runatlantis/atlantis/server/logging.(*StructuredLogger).Log\n\tgithub.com/runatlantis/atlantis/server/logging/simple_logger.go:163\ngithub.com/runatlantis/atlantis/server/controllers.(*JobsController).respond\n\tgithub.com/runatlantis/atlantis/server/controllers/jobs_controller.go:92\ngithub.com/runatlantis/atlantis/server/controllers.(*JobsController).getProjectJobsWS\n\tgithub.com/runatlantis/atlantis/server/controllers/jobs_controller.go:70\ngithub.com/runatlantis/atlantis/server/controllers.(*JobsController).GetProjectJobsWS\n\tgithub.com/runatlantis/atlantis/server/controllers/jobs_controller.go:83\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2109\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\tgithub.com/gorilla/[email protected]/mux.go:210\ngithub.com/urfave/negroni/v3.Wrap.func1\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:59\ngithub.com/urfave/negroni/v3.HandlerFunc.ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:33\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:51\ngithub.com/runatlantis/atlantis/server.(*RequestLogger).ServeHTTP\n\tgithub.com/runatlantis/atlantis/server/middleware.go:70\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Recovery).ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/recovery.go:210\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Negroni).ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:111\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2947\nnet/http.(*conn).serve\n\tnet/http/server.go:1991"}

Also when having multiple Atlantis behind the LB when you try to look for the logs of the plan you might or not get them since it is being load balanced 😅, with some flags maybe the Atlantis logs could be also "centralized" so any Atlantis instance can show you the logs.

Also at least using NFS it felt slow, so maybe look into store the plans in Redis could improve this 👀

cc: @nitrocode

Mar 03 '23 16:03 albertorm95

Ooofa... Thanks for the update

Mar 03 '23 17:03 nitrocode

How about a failover mechanism with shared PV/PVC? I don't think a HA multi-nodes is a good way to solve Terraform and Atlantis. Because normally only 1 worker at any time can execute the plan/apply to any tfstate.

So how about supporting a failover mechanism like:

/webhook -> instance 1
instance 1 went down
/webhook -> instance 2 with same setup.

I think a standby instance could just solve this easily.

Mar 22 '23 13:03 anhdle14

If you run in Kubernetes or an Autocaling group of 1 you'd already get that experience though @anhdle14. Having multiple nodes would mean zero downtime and give the ability to distribute work if you have multiple projects/repos managed by atlantis.

Mar 22 '23 13:03 jukie

yeah that is true, I was thinking more of a failover scenario when cluster went down for a particular zone/region. I think for my case I actually need to have the deployment on multiple clusters but this should work for a single cluster deployment. Also my solution to multiple projects/repos is to having multiple deployment because of isolation / blast radius and multi tenancy.

atlantis.example.com/team point to different deployment etc...

Mar 22 '23 15:03 anhdle14

I am working with @tapaszto who originally opened this thread. Since I can see there are some recent comments here let me share my thoughts.

We have been using Atlantis for almost 3 years. First, we were hosting it in an ACI then migrated to App Service and right now we are discussing moving to AKS. Since the Redis option became available for the lock DB we were planning to make our environment more resilient. The ultimate goal would be to have a multi-zone and multi-region active-active-active deployment.

Our preference would be to stay on App Service, that said there are certain storage limitations there. Since the repo content is still stored on disk, the disk needs to be shared across the nodes. For that either we use Azure Files (SMB) or Blobfuse (with AKS), but both of these are at least 5x slower than writing the content to a local disk. These are not options sadly because of the performance.

AKS is now offering shared ZRS Managed Disk support which we are actively exploring. This might solve the zone redundancy requirement if we move to AKS but still will not solve the geo-redundancy requirement. For now, we are considering a primary-secondary (active-passive) deployment, potentially sharing the locking database across regions but not the files as there is no technical solution for that.

I think that the next step for this project when it comes to resiliency is to have a solution for the git content/plan files.

Apr 12 '23 08:04 Dilergore

I was pointed to the https://github.com/lyft/atlantis fork which makes use of temporal workflows. That would be a heavy lift to pull in but something fully distributed like that is what I'd prefer vs NFS shares.

Apr 12 '23 16:04 jukie

@jukie - Thanks for sharing this, I spent some time going over this and I definitely have some questions and thoughts.

I went through the README of project Neptune. I can see how potentially Temporal and its engine would solve failures and would enable HA even across regions.

That said, It would be great to understand whether project Neptune is just a fork which planned to be used in Lyft or there is plan to merge this back in some shape and form as a new major version in the future to the upstream version.

It seems the Neptune workflow is targeting Terraform actions happening after a PR merge: This is a big behavioral change which (at leats for us) would not be the preferred way of handling deployments. There might be some edge cases but the majority of the deployments for us must happen the way they happen today for consistency purposes: The code cannot be merged before a terraform apply succeeds. For the type of workflow what Neptune tries to cover we already have options like CI/CD pipelines.

Do not get me wrong, this is useful and I see the value, but this fork is raising a lot of questions in my mind and it would be really great to see what is the future of the upstream version of Atlantis.

Apr 13 '23 06:04 Dilergore

@nishkrishnan from Lyft may have some comments about this too.

I think Atlantis is great, but it lacks in few areas when it comes down to Enterprise deployments and very busy deployments. The current workarounds work but at the core Atlantis was not built to be highly available and that is requirement for some companies.

I'm not opposed ( but I'm not the only mantainer) to expand on the Redis usage or maybe even bringing some of the Lyft work upstream but with the modifications needed to keep the current flow into Atlantis 2.0 for example.

This kind of effort will need coordination (which I'm willing to provide) and multiple people working actively/committed to this effort.

The possibility of multiple companies contributing to this is possible too.

@nitrocode @GenPage

Apr 13 '23 16:04 jamengual

Hey, i can speak a little bit about Lyft.

We completely rearchitected Atlantis to the point that a lot of original stuff in there is pretty much unused/deleted. Atlantis in it's current state is great in terms of flexibility but that's a double edged sword and especially impacts the testability and iterative development of the product. So in order to ease the rearchitecture and simplify things a bit, we've made an opnionated version in a way with less features but enough to POC the new backend in a reasonable amount of time.

That said, I don't believe it's worth it to try and re-integrate with upstream given the divergence. I see it as a new product entirely with a heavier dependency tree (ie. Temporal). Lyft initially wanted to have another repo in the Atlantis org owned by us where we could own, build and iterate on this version, but there were some political differences that stopped us, so we kept our work in our fork.

As for what we plan to do with it, i think that depends on general interest. I'd love to hear from the community about their usecases/setups etc. I'm usually out and about in the Atlantis slack channel so feel free to hmu and we can chat.

Apr 13 '23 19:04 nishkrishnan

I like the idea building the platform on Temporal, as I mentioned above, I totally see the benefits.

When it comes to use cases and setups, I will try to collect some of ours:

Zone and Geo redundancy
Support for Azure DevOps
Custom workflow capabilities
Kind of base workflow as what we have in the upstream Atlantis (PR comments and apply before merge)

I think the above are the most important ones. In my opinion there should be a tactical and a strategic solution. The tactical could be supporting Redis for the file store, while I could easily imagine a strategic end goal for a new more sophisticated major version maybe based on Temporal and pulling some of the Lyft code in.

Let me know your thoughts.

Apr 14 '23 05:04 Dilergore

I agree but whatever is built it needs to still support the current VCS types we have and a streamlined configuration method ( so we don't have 150 flags) with the mayor and most popular options used, which the lyft fork does not support.

As for the geo settings that I think can be achieved at the infrastructure level so if just a HA version is built we can improve from that.

Apr 14 '23 05:04 jamengual

atlantis atlantis copied to clipboard

Highly available cluster with multiple nodes

atlantis
atlantis copied to clipboard