fck-nat icon indicating copy to clipboard operation
fck-nat copied to clipboard

High availability tracking issue

Open AndrewGuenther opened this issue 3 years ago • 18 comments
trafficstars

If you're interested in high availability support for fck-nat, follow this issue. All PRs and subtasks will be linked here. This issue will not be closed until the 1.2 release of fck-nat launches.

AndrewGuenther avatar Dec 01 '21 00:12 AndrewGuenther

Love the idea of this. Just curious if you have any kind of timeline in mind?

rcoundon avatar Dec 04 '21 10:12 rcoundon

I'm targeting that high HA will be usable by mid-January. Documentation will be the long tail on this one.

AndrewGuenther avatar Dec 06 '21 16:12 AndrewGuenther

Great - thanks for the reply

rcoundon avatar Dec 06 '21 19:12 rcoundon

Mid-January target looking a bit optimistic. Whole family got COVID at the end of December and that's when I planned to do a good chunk of this. Still going to aim for end of January at the latest.

AndrewGuenther avatar Jan 04 '22 23:01 AndrewGuenther

Thanks for the update @AndrewGuenther, it's really appreciated. Hope you're all safe and well.

rcoundon avatar Jan 05 '22 09:01 rcoundon

A bit later than I'd hoped, but progress is being made on high availability. You can follow along on the ha-mode branch: https://github.com/AndrewGuenther/fck-nat/tree/ha-mode

The initial approach is to create a single instance autoscaling group with a static ENI that the currently active instance will attach at start. Initial implementation will be in CDK and both the deb and rpm packages will include support as well via a configuration option.

The CDK construct has been made and can be seen on the ha-mode branch. I'm currently working on adding attachment support to the fck-nat service. Once that is done, I'll be publishing the CDK construct to Construct Hub. After that, I'm open to feedback on where to prioritize publishing next.

AndrewGuenther avatar Feb 17 '22 16:02 AndrewGuenther

Hi, this is great stuff and I appreciate your work on it! Any idea when you think ha-mode will be good enough? What's left to do?

matthewpflueger avatar Jun 16 '22 18:06 matthewpflueger

So the ha-mode branch has a CDK construct which will properly configure an autoscaling group for a fck-nat instance. What needs to happen now is that the init script needs to be able to attach the ENI to itself when starting up.

In the main branch, the init script assumes that the ENI has already been attached and will simply configure routing. This works for a single EC2 host provisioned with a default ENI. In ha-mode however, we provision a "floating" ENI which will get attached to the active host. So if an instance fails, the replacement instance will re-attach that same ENI back to it with minimal traffic disruption. I actually have this script written, but I need to do more thorough testing on it. Once that lands, ha-mode will effectively be "done" and then it is just a matter of documenting how to properly configure the feature for various deployment methods.

AndrewGuenther avatar Jun 16 '22 19:06 AndrewGuenther

Some additional progress info: I've made good progress on the script changes, but debugging iptables is a massive pain. I'm also weighing two different approaches:

  1. In the first approach, I attach the ENI and then forward traffic to the default interface to handle the NAT. The benefit of this approach is that you don't need to muck with any of the default networking attachment settings, you're just adding a new interface and then forwarding to it.
  2. The second approach is to allocate an EIP for the ENI and then NAT that interface directly. This requires blowing away some of the default networking config which can be difficult to get right across distros and (theoretically) takes longer if the instance needs to be replaced.

I strongly prefer the first approach, but I'm having a hell of a time getting the iptables to do what I want and I only have so much patience for messing with them in a single sitting :sweat_smile:. Approach 2 however is working end-to-end, but isn't hardy enough for me to qualify it as HA. So with both approaches I have successfully modified the script and associated CDK to grant permissions for ENI attachment and can attach the ENI, it's just down to some iptables debugging...

AndrewGuenther avatar Jun 16 '22 22:06 AndrewGuenther

Alright. Got inspired this weekend to make some progress here. I'm going to stick with approach 2 as I've found a way to solve most of my concerns with it. I need to clean up the code for pushing, but the hard work here is done. I'm out a conference this week and I should have enough downtime there to button this up.

Thanks everyone for the patience, this has been a longer time coming than I had hoped.

AndrewGuenther avatar Jun 19 '22 20:06 AndrewGuenther

Alright, the ha-mode branch has been updated with a version of the fck-nat service with ha-mode working on Amazon Linux. I'm currently using the ec2-net-utils to simplify editing the interfaces, but that doesn't work on Ubuntu. The scripts aren't terribly difficult, it's really some nitty gritty in ec2ifup I need to replicate which I'm working on now.

Once that's working on Ubuntu, I'll add some documentation and publish version 1.2. One of the goals for 1.2 was to publish CDK and Terraform modules. Given that it's a PITA to do all of that in a single repo, I'll be spinning off a fck-nat-cdk and fck-nat-terraform repository for those.

AndrewGuenther avatar Jun 21 '22 16:06 AndrewGuenther

This is amazing! Thanks for all the hard work here. Looking forward to trying this out soon!

On Tue, Jun 21, 2022 at 12:23 Andrew Guenther @.***> wrote:

Alright, the ha-mode branch has been updated with a version of the fck-nat service with ha-mode working on Amazon Linux. I'm currently using the ec2-net-utils to simplify editing the interfaces, but that doesn't work on Ubuntu. The scripts aren't terribly difficult, it's really some nitty gritty in ec2ifup I need to replicate which I'm working on now.

Once that's working on Ubuntu, I'll add some documentation and publish version 1.2. One of the goals for 1.2 was to publish CDK and Terraform modules. Given that it's a PITA to do all of that in a single repo, I'll be spinning off a fck-nat-cdk and fck-nat-terraform repository for those.

— Reply to this email directly, view it on GitHub https://github.com/AndrewGuenther/fck-nat/issues/8#issuecomment-1161977612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA5KLKCJEGWGFZPCCZSMGLVQHT67ANCNFSM5JDEWEJQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

matthewpflueger avatar Jun 24 '22 23:06 matthewpflueger

I've been having trouble getting this working for Ubuntu. As a result, I'm going to drop the Ubuntu AMI right now. Supporting both Ubuntu and AL2 is stretching me a bit thin and overcomplicating things. It may make a return in the future, but supporting just AL2 will get this and other features landed more quickly which is what I want the focus to be for now.

AndrewGuenther avatar Jul 18 '22 17:07 AndrewGuenther

I am happy to report that the ha-mode branch has now been merged into main! I've confirmed that the AMIs are working as expected in both single-node and ha-mode and will be publishing a release shortly.

I've also been working on migrating the CDK code out into a dedicated construct which will also be published soon.

AndrewGuenther avatar Jul 20 '22 19:07 AndrewGuenther

CDK construct code in progress here: https://github.com/AndrewGuenther/cdk-fck-nat

AndrewGuenther avatar Jul 20 '22 20:07 AndrewGuenther

Thank you for all your work on this @AndrewGuenther - really appreciated

LeoLapworthKT avatar Jul 21 '22 07:07 LeoLapworthKT

The v1.2.0 AMI for fck-nat has been published in AWS and is available for use. The official cdk-fck-nat construct is being released as I write this as well which is the easiest way at the moment to get your hands on HA mode. As much as I want to resolve this issue, there's still some additional documentation I want to write, I'll close this as soon as that's done.

AndrewGuenther avatar Aug 12 '22 08:08 AndrewGuenther

NPM CDK module: https://www.npmjs.com/package/cdk-fck-nat Python CDK module: https://pypi.org/project/cdk-fck-nat/

AndrewGuenther avatar Aug 12 '22 08:08 AndrewGuenther

Documentation for this feature is up, it is supported out-of-the-box by the CDK module, and the warning has been removed from the README. Resolving!

AndrewGuenther avatar Aug 15 '22 00:08 AndrewGuenther