Introduce prefix allocation
What is this change about?
As described in RFC https://github.com/cloudfoundry/community/blob/main/toc/rfc/rfc-0038-ipv6-dual-stack-for-cf.md bosh shall be enabled to allocate prefix ip addresses (ipv4 and ipv6). Currently bosh only supports attaching ip addresses with a /32 (ipv4) or /128 (ipv6) prefix, which are single ip addresses. To apply these changes a new property called 'prefix' is introduced in the cloud config networks section. Please refer to the example below for a manual network:
- name: diego-cells-ipv6
subnets:
- az: z1
cloud_properties:
security_groups:
- sg-0005f94257313417d
- sg-06acfe8fb0a6247f0
- sg-05a8b2b2e26ac1d5d
- sg-064a667fb375e2dac
- sg-01bb6e1e1f821fe4c
subnet: subnet-0beae7541e0ebf5f1
dns:
- 2600:1f18:7415:8f00:0000:0000:0000:0253
prefix: 80
gateway: 2600:1f18:7415:8f00:0000:0000:0000:0001
range: 2600:1f18:7415:8f00:0000:0000:0000:0000/56
reserved:
- 2600:1f18:7415:8f00:0000:0000:0000:0002
- 2600:1f18:7415:8f00:0000:0000:0000:0003
- 2600:1f18:7415:8f00:ffff:ffff:ffff:ffff
type: manual
This example network tells bosh that instead of assigning a single ip address, to assign a prefix. So to slash the /56 network into multiple /80 blocks.
The ip address allocation of the bosh director is adapted to consider these prefixes (previously the director was just counting up by 1)
One major change is how the ip addresses are stored inside the database. The address_str field in the ip_addresses table will change from storing the ip address as an integer representation of the ip to store the ip address in cidr notation. This change is necessary to not "lose" the prefix information when storing the ip address. Also it has the advantage that you can directly create an IpAddrOrCidr Object out of the string coming from the database.
This PR also changes the RPC Interface for the create_vm method. It will include a separate field called "prefix". We will send the Prefix information in a separate field to not break existing cpis. Older CPIs that do not support prefix allocation will just ignore this field. Below you can find an example network section of the create_vm call:
{"ha_proxy":{"type":"manual","ip":"10.0.72.1","prefix":"32","netmask":"255.255.224.0","cloud_properties":{"security_groups":"<redacted>","subnet":"<redacted>"},"default":["dns","gateway"],"dns":["10.0.0.2"],"gateway":"10.0.64.1"}
The prefix here is 32 indicating a single ip address.
Once the vm creation is done. The bosh director api will return prefix ips with the normal ips. E.g. the bosh cli would display this as it displayed single ip addresses before. Here one example with one ipv4, one ipv4 prefix, one ipv6 and one ipv6 prefix with two instances:
What tests have you run against this PR?
Unit tests Acceptance Tests in progress
How should this change be described in bosh release notes?
Enable prefix allocation support for manual networks.
Does this PR introduce a breaking change?
No
Tag your pair, your PM, and/or team!
@anshrupani @DennisAhausSAP
Hi @aramprice, I just watched the recording of the foundational infra meeting from April 17th.
To clarify a bit: Diego is currently using Silk as overlay network that is independent from the IP address(es) of the Diego VM itself.
The idea for IPv6 is to use as much native network routing as possible, i.e. not use an overlay network for IPv6. This means that the IP addresses that would be assigned to the containers will be delegated from an IPv6 prefix that is assigned to the Diego VM.
Your question about whether those IP addresses would move in case of an evacuation: no. These IP addresses are not "sticky" to the app or app instance and would not move.
The goal is to make the networking setup for Diego simpler by using "native" IPv6 addressing. Traffic aimed at a particular container will reach the Diego VM via its CIDR range, and then the Diego VM's kernel can forward the traffic to the container's virtual NIC.
Please also note that the networks are supposed to be dual stack, not pure IPv6. So you would want to assign multiple (at least one IPv4 and one IPv6) networks to the same VM, as @fmoehler mentioned in the call already.
The "prefix" parameter is also the size of the prefix to delegate to each VM from a larger range. Have a look at the discussion while creating the RFC for a more extensive example.
The VM is supposed to self-assign an address to itself. Usually this is the x:y:z::1 address (i.e. the "first" address in the provided CIDR range). Addressing the VM is done via that IP address. The "remaining" addresses are then to do for the VM as it wishes.
Hi @peanball,
Thanks for that additional context - the info and the RFC were helpful.
In thinking about the overall changes to Bosh a few goals or principles (maybe too strong a word) came to mind:
- Make conditional behavior / IP-version specific code and settings as limited as possible
- I really like that
prefix:is used and notipv6_prefix: - Should an IPv4 network allow/require a
prefix:(one that selects a single IP) for consistency?
- I really like that
- Reduce the back and forth conversion / formatting where possible (
format_ip(),ip.to_iand similar)- Ideally a single
IpAddrOrCidr(I apologize for the name) orIpAddressrepresentation would be used, and conversions would be handled only where need by the class itself, ex:#to_db_value- for persistence, or#to_s- for logs
- Ideally a single
- Avoid assumption in the code that bosh will always be dual-stack
- I realize the system will start out dual-stack but this may not be forever
This isn't to imply disagreement about the current state of things, only to captures my thoughts at the moment.
@aramprice thanks for your feedback!
I just want to elaborate a little on our current idea, but of course this is open for ideas. Regarding your points:
- We will introduce a "prefix" property in the subnet section of the network (Maybe we can also put it to the network section itself, so that it applies automatically to all subnets?). This property will be considered for ipv4 and ipv6 networks. However if the property is not defined (e.g. for existing networks), the director will consider the "prefix" as /32 (ipv4) or /128 (ipv6) and maintain this accordingly in the database.
- Yes we will try to get rid of as many conversion as possible. Sometimes it can be handy to have the ips in integer representation, but most of the time we will pass them as IpAddrOrCidr Objects
- Agree with that. I did not test this (yet), but it might not even be specific for a dual stack setup. Probably it would also need changes in the bosh agent.
Thanks @aramprice, I fully agree with you there. The logical unit of "IP address", with or without netmask (i.e. prefix) should be supported in either scenario and contain the logic of representation.
As @fmoehler mentioned, omitting the prefix will default to a single address (previous behavior).
And there should not be assumptions about dual stack but dual stack should be possible. So far, BOSH supported either v4 or v6, not both at the same time. The mechanism to support more than one network is a classic n+1 problem and can be solved as such where possible.
We just happen to choose n=2 with one v4 and one v6 network. This should be the mindset.
@fmoehler I have a general comment on the term prefix in the config. It's a bit terse and does not convey the meaning very clearly.
In the end this is the prefix (size) delegated to each instance attached to this subnet, right? Can we somehow express this better? My suggestion would be
delegate_prefixdelegate_prefix_sizedelegate_cidr_size
alternatively s/delegate/assign/g.
One thing regarding the display of the ip addresses. The bosh-cli is showing the ips as they come from the directors API. As @fmoehler mentioned the prefixes are stored in the db with their prefixes. This also means that the bosh-cli will show the ip addresses with their prefix. Because all the docs are showing examples with ipv4 only this would also mean that everybody has to be prepared that with this change the cli could possibly show the ips (i.e. with command bosh ... vms) in format "ip<v4\v6>/prefix".
One thing regarding the display of the ip addresses. The bosh-cli is showing the ips as they come from the directors API. As @fmoehler mentioned the prefixes are stored in the db with their prefixes. This also means that the bosh-cli will show the ip addresses with their prefix. Because all the docs are showing examples with ipv4 only this would also mean that everybody has to be prepared that with this change the cli could possibly show the ips (i.e. with command bosh ... vms) in format "ip<v4\v6>/prefix".
So far we ensured not to change the API and only return the ip (without the prefix), please refer: https://github.com/cloudfoundry/bosh/pull/2611/files#diff-c9cf4222f4ba33b6b3f35d7f32bfbb2c03a91470e18672bd3dda5468c38a0808R22 so it should not change the way ip addresses are displayed.
@fmoehler is this ready for review? Also there are some failing checks. Could you please have a look.
@beyhan yes this is ready for review. I have checked the failing tests, but they are not failing locally for me. I will check those again, but maybe also someone has a hint on why they are going wrong.
@fmoehler try to run the unit tests with the same version of Ruby 3.3.8 which is used by the check. If I do that I see that the test fails locally.
BTW. During our tests we found a bug that is related to the vast number of ip addresses in the ipv6 address space. The issue originates from the "reserved" section in the cloud config. As the code in some places iterates over all the reserved ip addresses it can cause the director to become unresponsive if the ranges are becoming to big (in our example we reserved a /80 range which are 281,474,976,710,656 ip addresses). However, this issue is not related to this change but has been present before. We will try to tackle this as well in a separate commit in this PR, but this can still already be reviewed in the current state.
Edit: Fixed with 3ab3b96
This is ready for review
FYI: I am rebasing this PR from time to time, that is why you will see some force-pushes. The coding does not change. Still ready for review.
After feedback in https://cloudfoundry.slack.com/archives/C094QRLEQ/p1753985883042609?thread_ts=1753975247.413479&cid=C094QRLEQ we introduce a seperate property called nic_group. It Shall be defined in the manifest networks section and should specify which networks should tried to assign to the same network interface card. This would look something like this:
networks:
- name: ipv4-network
nic_group: 1
- name: ipv6-network
nic_group: 1
This director will basically forward this to cpis giving those the possibility to assign ips from different network to the same network interface card. This change was implemented with aa7ffff
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: fmoehler / name: Felix Moehler (77aefa975ef700e19e9d7e37d8cd0a5271c3e16d, 3df9df2a42e2944e8442a9579890b099b9eccb85, ba39e7ebec520008bece8d7c00c77fde0dafa16a, 3382cccc5677c4d9242682daba8ac3e4702282b0, 5b8de4bd2f55d4d1e7fd57b9eb88ee3bfd4e6931, b80ec1be64a4c1a09bce0bffce2ef7c6666aa607, d0aa94fa82bd51172d41145f7934c03166d43216, e7053918253c5fda19ede01845fc754531e3caca, 9bbae2c0e27a481dc82d8988065b8ab896dae1e7, 0ef67250ceb105d86386e6039bd8cfde5dcc7592, 56601c52b97f19bdf1ae4808fc600b74647a1b81, fd06c97ff597a687d01fe271b3b1a6c4e583f24d, ff37709e6eed61207eaa02ca9d9ed10921b7c3c4, 8331c4da007033f003b9ff8fd9773c5e30d9bc2d, 778bded4b12761a1215f659bc054dc0777ac7f1f, 0951cb7b7e7401d18c02da8cd761235761330baa, 76a3dee8026b7689cab1abd6a0044a9b7440cf65, 826e4bc34e96ce924ce36671e7f1410c9dcd2c76)
- :white_check_mark: login: Ivaylogi98 / name: Ivaylo Ivanov (64b36f4afae63b91f278e39be718e22cbeefc528)
- :white_check_mark: login: neddp / name: Ned Petrov (44535d105edfcf14b426f9a33d57c02e6d28c79a, 87fa8d14661d6bbd2668a3292e75b7dbd27d0337)
Rebased on 282.0.7
Rebased to 282.0.8
Any reviews are appreciated. We have this running on an internal test landscape and it looks fine there for now, but we heavily rely on the manual network type, so any reviews also specifically for dynamic and vip network types are appreciated.
We discussed in the meeting, we should make sure that using nic_group behaves in a compatible way if the cpi does not support it. Note: we might need to bump the cpi version.
For people who are wondering about the nic_group. This is what the robots have found so far:
## Flow Path:
1. Configuration Parse: nic_group is parsed from the deployment manifest in InstanceGroupNetworksParser.parse() at
src/bosh-director/lib/bosh/director/deployment_plan/instance_group_networks_parser.rb:36
2. JobNetwork Creation: The parsed value is stored in a JobNetwork object at src/bosh-director/lib/bosh/director/deployment_plan/job_network.rb:11
3. Network Reservation: When creating network reservations, the nic_group flows from JobNetwork to DesiredNetworkReservation via the network planner at
src/bosh-director/lib/bosh/director/deployment_plan/network_planner/planner.rb:9,15
4. Network Settings Generation: In NetworkSettings.to_hash() at src/bosh-director/lib/bosh/director/deployment_plan/network_settings.rb:36, it calls reservation.network.
network_settings(reservation, ...)
5. Manual Network Processing: The key step is in ManualNetwork.network_settings() at src/bosh-director/lib/bosh/director/deployment_plan/manual_network.rb:80-83 where:
nic_group = reservation.nic_group
if nic_group
config["nic_group"] = nic_group.to_s
end
6. Instance Plan: The network settings hash is generated via InstancePlan.network_settings_hash() at src/bosh-director/lib/bosh/director/deployment_plan/instance_plan.rb:289-290
7. CPI Call: Finally, in CreateVmStep.perform() at src/bosh-director/lib/bosh/director/deployment_plan/steps/create_vm_step.rb:31, the network settings hash is passed to create()
method, which then calls cloud.create_vm() at line 150 with the network_settings parameter containing the nic_group field.
The nic_group field flows from the deployment manifest through the network planning system, gets added to the network configuration hash in the ManualNetwork.network_settings
method, and is ultimately passed to the CPI's create_vm method as part of the network_settings parameter.
Based on the analysis of https://github.com/cloudfoundry/bosh-aws-cpi-release/pull/181, the nic_group field is used to group multiple networks together onto the same network interface in AWS EC2 instances. Here's how it works:
Purpose:
• Groups multiple BOSH networks to share a single AWS network interface
• Allows multiple IP addresses (IPv4/IPv6) to be assigned to one network interface
• Enables network configuration where multiple logical networks use the same physical interface
Key behaviors:
1. Network grouping: Networks with the same nic_group value are combined into a single NicGroup object (src/bosh_aws_cpi/lib/cloud/aws/network_interface_manager.rb:18-20)
2. Default behavior: If no nic_group is specified, it defaults to the network name (src/bosh_aws_cpi/lib/cloud/aws/cloud_props.rb:306)
3. Subnet validation: All networks in the same nic_group must use the same subnet ID (src/bosh_aws_cpi/lib/cloud/aws/nic_group.rb:71-74)
4. IP assignment: The first IPv4 and IPv6 addresses from networks in the group are assigned to the interface (src/bosh_aws_cpi/lib/cloud/aws/nic_group.rb:82-100)
5. Interface creation: Each unique nic_group results in one AWS network interface being created (src/bosh_aws_cpi/lib/cloud/aws/network_interface_manager.rb:42-44)
This allows BOSH deployments to have multiple logical networks (e.g., for different services or VLANs) while using fewer AWS network interfaces, which can be useful for instance
types with limited network interface capacity.
Is grouping of logical networks onto a single network interface a feature that is supported by all cloud providers? Or is it something that is CPI specific? Because if it is cloud specific it should be propagated through cloud_properties.
Is grouping of logical networks onto a single network interface a feature that is supported by all cloud providers? Or is it something that is CPI specific? Because if it is cloud specific it should be propagated through cloud_properties.
Hi @rkoster , sorry for only coming back now. The feature is supposed to be available for all cpis. Of course now cpis beside aws will ignore this property. But we are planning on implementing the feature at least on some other cpis as well in the future. We also have this limitation documented here: https://github.com/cloudfoundry/docs-bosh/pull/877 . Does this answer your question?
BTW we have been running a bosh director including these changes on dev, staging and since a few days also on integrate landscapes. And so far we have not come across any issue. But as mentioned we heavily rely on manual networks.
Rebased to 282.0.9