bottlerocket
bottlerocket copied to clipboard
Configurable MTU
What I'd like: I'd like to be able to configure the MTU of EC2 instances running bottlerocket.
I'm using cilium on AWS EKS with cluster mesh across multiple AWS VPCs through a Transit Gateway. Unfortunately, Transit Gateways support a max of 8500 MTU and the EKS optimized AMIs use jumbo frames.
Any alternatives you've considered: None
Related discussion: https://github.com/bottlerocket-os/bottlerocket/discussions/3338
I'd be happy to take a stab at implementing this. I looked around a little bit and gathered this much:
- aws ec2 variant uses wicked systemd service to configure network interfaces
- the systemd pre network hook calls
netdog generate-net-config - netdog looks for a file or kernel parameters for input to generate the wicked config
So how would adding this feature fit into this setup? Should the api server interact with netdog in some way? or is netdog updated to support mtu config and that is added to kernel parameters or netdog.toml file? Maybe a bit of everything...
Thanks for looking in to this @tskinner-oppfi!
Things are complicated a bit by timing here - in that there is active work being done to migrate away from wicked to systemd-networkd. So I think there it might be possible to get something in to support this with wicked, that code is very close to going away.
It looks like Zach and Matt have added a lot of good detail to that issue. If there's anything you are able to add or contribute there to get MTU support, that would be awesome!
Now that the networkd work looks to be mostly done (awesome work!) I started looking at this again and here a couple questions:
- Would we add another subcommand to netdog to configure just the MTU and expose it as a setting similar to the hostname network setting?
- Do we restart the networkd service after the config change or run
networkctl reloador something similar? I believenetworkctl reloadwon't bring down the network but just reconfigure with new settings. It gets kind of dicey messing with network settings.
In the mean time I'll experiment a little and try to get some changes working so I can have something concrete to talk about.
@tskinner-oppfi Thanks! Contributions are always welcome!
As @stmcginnis mentioned above, we've done the work to integrate systemd-networkd, but currently it is only used in aws-k8s-1.28, the *-dev variants, and aws-ecs-2. The other variants continue to use wicked as their networking backend. Because of this, the process of adding new network config settings is admittedly a bit more work since we need to support config file generation for multiple backends. We want to make sure that the network backend is all but invisible and all of our variants support the same settings.
At a high level, this is how network config is generated:
- A systemd service
generate-network-configis required by thenetwork-pretarget; it runsnetdog generate-net-config netdogreads network config from a file (net.toml), or kernel parameters- This network config is validated and deserialized into a set of Rust structs
- Depending on the network backend,
netdogconverts these structs into different structs that represent the actual config fileswicked/systemd-networkduse.
In order to add new settings, there are a few considerations:
- The creation of a new version of net config.
- Ensure that
wickedsupports MTU via config file (I have not looked into this yet) systemd-networkdhas a few settings related to MTU; currently Bottlerocket defaults to using the MTU from DHCP. We'd want to make sure we handle this correctly with a custom setting.
From the code perspective, these would be the additions off the top of my head:
- Add the new "v4" net config structs and their associated validation and unit tests
- Add the
wickedstruct members and the associated logic to convert net config structs ->wickedstructs - Add the
systemd-networkdconfig struct members, as well as the additional builder methods to add values to the config structs. Add the small amount of logic to drive the builders, calling these new methods. - Unit test all the things.
Happy to provide additional direction and answer questions!
Also interested in this feature. It's common for enterprises to use AWS Transit Gateway to connect their enterprise networks to AWS. VPCs. So we really need some consistent way to configure a lower MTU.
Recently we've also seen some cases where some Bottlerocket nodes randomly get MTU 1480 on eth0 and other cases where it seems to get the full 9001. Having an explicit way to configure it is needed
Hey @jcmcken , thanks for the renewed interest on this old issue! We will discuss within the team on how we want to move forward with this feature and then get back to you.
Getting back: In a normal scenario, we expect this value to be configurable at the network level (DHCP settings) rather than OS, so that is our current recommendation.
However, we can potentially add this setting, so leaving this issue open for contribution.
Getting back: In a normal scenario, we expect this value to be configurable at the network level (DHCP settings) rather than OS, so that is our current recommendation.
However, we can potentially add this setting, so leaving this issue open for contribution.
Is there an example you have for configuring it in the DHCP settings? In AWS, the DHCP settings are configured through DHCP option sets, which have no parameter for MTU. I don't think there's any other location where you can configure DHCP network settings within the environment
Or are you suggesting configuring the systemd-networkd DHCP settings? (Although you said "...rather than OS..." so I don't think this is what you mean). If this is what you mean, do you have a working example / workflow?
For clarification: In our case, we're using EKS with AWS VPC CNI. Our corporate network setup has a "shared services" VPC with VPC endpoints (VPCEs), one of those endpoints being ECR. This VPC endpoint is accessed over an AWS TGW from connected VPCs. So we need to be able to pull AWS VPC CNI images from ECR VPCE over the TGW. Thus we need the MTU to fall below 8500 (in our case, we prefer 1480 because of some other details), otherwise the connection just hangs.
AWS VPC CNI itself has some logic to set the host's CNI to the same value as the pod interface MTU you configure in the CNI settings. But to do this, we need to be able to pull the image in the first place. There's kind of a chicken-and-egg problem here
From the ENA driver docs:
The driver supports an arbitrarily large MTU with a maximum that is negotiated with the device. The driver configures MTU using the SetFeature command (ENA_ADMIN_MTU property). The user can change MTU via ip(8) and similar legacy tools.
A workaround in the absence of a dedicated MTU setting could be to use bootstrap containers to configure the MTU; however, I think you will likely hit the same chicken/egg problem of needing your AWS TGW to access the container.
Not immediately helpful to solving the issue, but another point to note is that our Bottlerocket variants that use wicked are approaching their end-of-life (aws-ecs-1, aws-k8s-1.27). Only needing to support systemd-networkd here would greatly simplify the implementation.
A workaround in the absence of a dedicated MTU setting could be to use bootstrap containers to configure the MTU; however, I think you will likely hit the same chicken/egg problem of needing your AWS TGW to access the container.
Right, this was brought up in the original discussion (of which this issue is based -- https://github.com/bottlerocket-os/bottlerocket/discussions/3338). The issue is that there's some kind of race condition. So it's not as trivial as just running a bootstrap script setting the device MTU
we are also facing this issue. as a workaround, I was able to successfully change the MTU on the bottlerocket host using a k8s daemonset that reconfigures systemd-networkd.
obviously this is not an ideal solution, so it would be great to have a proper config option for this exposed in bottlerocket.
in case someone else needs it, the daemonset workaround approach is described here, on the discussion page: https://github.com/bottlerocket-os/bottlerocket/discussions/3338
I would intuitively expect this to be a net.toml setting - would this be an acceptable approach for everyone involved (to just add it to netdog directly and not make it settings configurable)?
I would intuitively expect this to be a net.toml setting - would this be an acceptable approach for everyone involved (to just add it to netdog directly and not make it settings configurable)?
It's most correct to add to net.toml but for EC2 we'll also need to expose some way to set it via user-data, otherwise the AMI has to be re-built / re-registered to drop in that file (not fun).
What that might look like is:
- Ensuring there's a way to override settings
netdog.default-interfaceon the kernel command line viasettings.boot.init-parameters. (Right now, trying to do this will brick the instance.) - Coming up with a way to specify MTU, e.g.
netdog.default-interface=eth0:dhcp4,mtu@8500 - Parsing the new input into a
NetConfigV4struct instead ofNetConfigV1. - Using bootstrap commands to change the parameters and then reboot.