fck-nat icon indicating copy to clipboard operation
fck-nat copied to clipboard

NAT64 Support (Jool)

Open RaJiska opened this issue 1 year ago • 12 comments

Adds NAT64 support through Jool, as a better alternative to the previous NAT64 PR I made with TAYGA (https://github.com/AndrewGuenther/fck-nat/pull/50). As rightfully pointed out in the associated issue (https://github.com/AndrewGuenther/fck-nat/issues/41#issuecomment-1819000799), Jool is a much better alternative than TAYGA as it works as a kernel module rather than the userland TAYGA, in addition of being a stateful (vs stateless) NAT64 implementation as well as having less constraints.

With Jool, NAT64 is added as a transparent builtin to fck-nat.

A side effect is that AMI now takes significantly longer to build, about 12 to 14 minutes, whereas it'd previously take about 5 minutes without Jool.

RaJiska avatar Nov 22 '23 10:11 RaJiska

Hey @AndrewGuenther, gentle bump regarding this PR and the couple other ones pending review.

RaJiska avatar Dec 13 '23 02:12 RaJiska

Apologies, I've been travelling the last few weeks for AWS re:Invent and another conference and playing catch up. I'll get to this one soon.

AndrewGuenther avatar Dec 13 '23 04:12 AndrewGuenther

No worries, thanks for looking into it!

RaJiska avatar Dec 13 '23 05:12 RaJiska

@RaJiska alright. It's finally happening. I'm 90% good with this change as is, but what I'd really like to do is build an RPM for Jool and install that separately so all the build dependencies and outputs don't get baked into the AMI. I'll work on the best way to do that and will update and merge this PR once I'm happy with it.

If you'd like to help, I'll gladly take it, but you've done a huge amount of the heavy lifting here already and I appreciate it!

AndrewGuenther avatar Jan 20 '24 00:01 AndrewGuenther

Amazing news, glad you have been able to go through it. You mentioned you like dependencies managed independently of the AMI, please correct me if I'm wrong, does that mean you'd rather have those dependencies to be installed via user data upon instance setup ? If this is the case I can see how having a light AMI is a benefit, however it'd also mean that instance startup would take longer which might hinder the recovery metric in HA mode with network taking longer to recover upon instance crash or similar for users enabling those extra features. Perhaps two different AMIs could solve this issue, one light with core features, and one with extra features.

I am currently in the middle of a moving and won't have time to work on open source projects in the coming few weeks, but will be happy to work on this if hasn't been done by the time I moved, or even on additional features :)

RaJiska avatar Jan 21 '24 05:01 RaJiska

I'd still like it installed by default, I just don't particularly like building it in the AMI and having all the build dependencies installed.

So my ideal would be to use another host to create an rpm installer and use that to install here. That way stuff like gcc and such aren't hanging around.

I looked into the best way to do this today and given the complexity of jool's install I'll likely just augment this for now to uninstall the build tools after jool is installed so we can get this in for 1.3

AndrewGuenther avatar Jan 21 '24 05:01 AndrewGuenther

Understood! Indeed I've been looking for a way to do something like Docker multi-stage building but couldn't find how to in an AMI context as it seems there is no builtin way to have an instance used as intermediary building container, passing its output to the final AMI.

Perhaps we could try to leverage Docker and using it this way ensuring Docker runs with an exact replica of the initial AMI and therefore ensuring compatibility once we copy built binaries from the docker container into the host. This way only Docker has to be installed and uninstalled, with building dependencies removed as containers are removed.

This is more complex than just uninstalling dependencies but ensures that none are missed during the uninstall phase.

RaJiska avatar Jan 21 '24 06:01 RaJiska

I don't think that method is going to work due to kernel module dependencies. Docker will share the underlying kernel so unless we build the RPM on a host with a matching kernel version it won't necessarily work. That said, there's likely still something we can do here.

I'm currently focused on wrapping up #18, but once I'm done with that I'll get this merged and add some tests for it. How did you go about testing this during dev @RaJiska?

AndrewGuenther avatar Jan 22 '24 18:01 AndrewGuenther

Got main merged in and uninstalled the build deps, just need to get a good test setup

AndrewGuenther avatar Jan 23 '24 04:01 AndrewGuenther

I don't think that method is going to work due to kernel module dependencies. Docker will share the underlying kernel so unless we build the RPM on a host with a matching kernel version it won't necessarily work. That said, there's likely still something we can do here.

Indeed if we ever were to think about this, we'd still need to install some dependencies that the host would require to run the kernel module.

How did you go about testing this during dev @RaJiska?

To test, I used to build the AMI on my account. Then I setup a VPC with a public subnet in dualstack, and a private subnet using IPv6 only. I start the fck-nat (with Jool) in the dualstack'd public subnet and then start an instance that needs NAT64 support in the private IPv6 subnet then try to access an IPv4-only website (you can try twitter.com which does not have AAAA records). Don't forget to add a route sourcing 64:ff9b::/96 to fck-nat's ENI (it's the equivalent of 0.0.0.0/0 -> fck-nat's ENI in the IPv4 usage).

You can make this hassle free by using the Terraform module's full example on the nat64 branch which features a full setup of networking and fck-nat. After applying the example, you should just need to manually start an instance in the private ipv6 subnet and SSM/SSH/Serial to it to test.

Do note that SSM is clunky and unstable when using NAT64. This is because of DNS resolution which takes a crazy amount of time when using AWS' AMIs, as documented here.

RaJiska avatar Jan 23 '24 05:01 RaJiska

Do note that SSM is clunky and unstable when using NAT64. This is because of DNS resolution which takes a crazy amount of time when using AWS' AMIs, as documented https://github.com/AndrewGuenther/fck-nat/issues/41#issuecomment-1822467555.

This worries me a bit. Given that I want to get a 1.3 release out by the end of this week, I think bumping this to a 1.4 release (even if that's only a few weeks from now) is probably the best option. I don't want to introduce instability to IPv4 use cases and we've got a good amount of unreleased goodies stacked up in 1.3 I want to get out the door.

AndrewGuenther avatar Jan 23 '24 19:01 AndrewGuenther

Yup makes sense!

Though from what I remember this issue does not impact fck-nat itself or the instances using IPv4, but those instances configured in IPv6 mode and relying on fck-nat for their 6to4 IP translation which is required for AWS SSM to work. I'd have to get back into it to test and confirm when I have more time.

RaJiska avatar Jan 26 '24 16:01 RaJiska

Going to merge this into a release branch a publish a pre-release AMI for it so we can get folks who are interested from the original issue testing this out. @RaJiska did you see this comment? Would you be able to test?

https://github.com/AndrewGuenther/fck-nat/issues/41#issuecomment-2011033331

AndrewGuenther avatar Mar 30 '24 22:03 AndrewGuenther