Proposal: Allow hinting to ipam a preferred subnet size from default pool
- What I did
Added a new preferred_default_address_pool_size ipam option to hint a preferred subnet size for networks allocated from the default pool. This is treated as optional and invalid values (or values that would result in a larger subnet than configured for a default pool) are ignored.
Implements #50107
- How I did it
Added support for a new ipam option in the defaultipam implementation that, if present, will pass a preferred subnet size to addrSpace.allocatePredefinedPool and may result in a smaller subnet being allocated if specified.
- How to verify it
I've added unit tests for the new behavior, but the logic can be manually verified by passing the preferred_default_address_pool_size ipam option to docker network create with a smaller subnet size than would be provided by the default pool configurations (i.e. --ipam-opt preferred_default_address_pool_size=24).
- Human readable description for the release notes
- A picture of a cute animal (not mandatory but encouraged)
One thing I haven't done in this PR is account for the case where only a subset of the available network pools may be able to fulfill the given preferred network size. For example, imagine pool A has a default size of /24 and pool B has a default size of /20 and the user requests a preferred size of /20. It likely makes sense to attempt to attempt to allocate an address from pool B first (even if pool A isn't fully allocated) before falling back to a /24 network from pool A if there's no available range in pool B.
If a preferred network size is requested, it's likely worth ordering the available pools by default pool size before attempting to allocate a subnet to ensure that subnets that can fulfill the preferred size (or are closes to the preferred size) get precedence.
Thank you! It's perfectly fine for things to be incomplete, or even "throw away code". Having something to look at, or to give it a spin, and to have the discussion never hurts!
I ended up adding a step to sort the pools if a preferred size is specified (with the idea being to prefer pools that are able to accommodate the preferred size when allocating a network).
My current design treats the default (or configured) pool prefix sizes as hard limits; the resulting network can have a smaller address range than the default, but never larger. For example, if a pool was configured with a default size of 24 and a request was made to create a network with a preferred prefix size of 16, any network from the pool would still be limited to a 24 prefix.
One other possible approach would be to honor the preferred prefix size as long as it was valid for the underlying base network for a given pool. For example, take a default pool configured with a base network of 172.17.0.0/12 with a default allocation size of 24. If a preferred size of 20 was given, that could be honored as it still fits within the overall /12 base subnet. In that case only the base prefix size would need to be considered when evaluating networks, not the default allocation size.
Just found a previous PR/discussion related to address pool sizes. So, linking it here, but it looks like that work stalled in April last year ...
- https://github.com/moby/moby/pull/47737#issuecomment-2078055376
Just found a previous PR/discussion related to address pool sizes. So, linking it here, but it looks like that work stalled in April last year ...
I need to do some work to get this ready to move out of draft; I've been mostly heads down on an upcoming release at work, but should have time again to devote to this proposal now.
I'm suspicious there'll still be a few changes warranted to this PR, but I think it's at a place where it's at ready for actual review.
but it looks like that work stalled in April last year ...
Author of #47737 here, I was not given clear instructions or guidance on the next steps. There are unresolved arguments on whether "non network-savvy users can easily read and identify /22 subnets" and that's pretty much why it stalled. On my side all required code changes have been done and all I was doing was awaiting input.
but it looks like that work stalled in April last year ...
Author of #47737 here, I was not given clear instructions or guidance on the next steps. There are unresolved arguments on whether "non network-savvy users can easily read and identify /22 subnets" and that's pretty much why it stalled. On my side all required code changes have been done and all I was doing was awaiting input.
I'd argue the two changes are complementary; I don't think anything in this PR invalidates the benefit of more reasonable defaults that increase the number of simultaneous networks available to the average Docker user. I see this PR as being more beneficial for developer tooling that interacts with Docker networking (coincidentally exactly what I work on, but I also recognize it's a more niche scenario). My main motivation with this PR is that, if I know that even 1024 unique addresses is more than a given network will require, I'd rather be a good steward of a developer's system resources and request a smaller range such as /24 (or even /25).
[We discussed this PR during today's libnet maintainers call.
We all agreed that the --subnet-based UX is better than forcing users to specify an --ipam-opt.
Regarding API representation, we think that it's fine to put unspecified prefixes in IPAM.Config.Subnet. Older daemons will accept these values, but will fail to attach any container to such networks (because they'll try to allocate / assign the 0.0.0.1 address). There's not much we can do about that — clients will need to discover whether the daemon support this feature (by checking the API version).
We considered putting the subnet size in an IPAM option at the API level — i.e. while still allowing users to specify unspecified prefixes through --subnet — but this would require the CLI / Compose to translate subnet values into IPAM options. Since this would be a new CLI / Compose behavior, older versions would send these as IPAM.Config.Subnet, so that wouldn't solve the initial compatibility issue (i.e. consumers would have to check CLI / Compose version to discover whether they support that translation, plus they'd have to check the API version to know whether the daemon will honor it).
Regarding this comment:
My current design treats the default (or configured) pool prefix sizes as hard limits; the resulting network can have a smaller address range than the default, but never larger. For example, if a pool was configured with a default size of 24 and a request was made to create a network with a preferred prefix size of 16, any network from the pool would still be limited to a 24 prefix.
We're rather erring on the side of making the subnet size a mandatory requirement the Engine has to honor, instead of an hint that the Engine may decide to ignore. The reason is that we're considering https://github.com/moby/moby/pull/47737 as a next step, and we'd like to take that PR even further by making the default subnet size a global config parameter instead of a per address pool parameter as is currently the case.
[We discussed this PR during today's libnet maintainers call.
We all agreed that the
--subnet-based UX is better than forcing users to specify an--ipam-opt.Regarding API representation, we think that it's fine to put unspecified prefixes in
IPAM.Config.Subnet. Older daemons will accept these values, but will fail to attach any container to such networks (because they'll try to allocate / assign the 0.0.0.1 address). There's not much we can do about that — clients will need to discover whether the daemon support this feature (by checking the API version).We considered putting the subnet size in an IPAM option at the API level — i.e. while still allowing users to specify unspecified prefixes through
--subnet— but this would require the CLI / Compose to translate subnet values into IPAM options. Since this would be a new CLI / Compose behavior, older versions would send these asIPAM.Config.Subnet, so that wouldn't solve the initial compatibility issue (i.e. consumers would have to check CLI / Compose version to discover whether they support that translation, plus they'd have to check the API version to know whether the daemon will honor it).Regarding this comment:
My current design treats the default (or configured) pool prefix sizes as hard limits; the resulting network can have a smaller address range than the default, but never larger. For example, if a pool was configured with a default size of 24 and a request was made to create a network with a preferred prefix size of 16, any network from the pool would still be limited to a 24 prefix.
We're rather erring on the side of making the subnet size a mandatory requirement the Engine has to honor, instead of an hint that the Engine may decide to ignore. The reason is that we're considering #47737 as a next step, and we'd like to take that PR even further by making the default subnet size a global config parameter instead of a per address pool parameter as is currently the case.
Sounds good, the current implementation is fairly close to this already; I'm treating the subnet size as mandatory (when specified) and honoring it via --subnet (works for multiple subnets including a mix of IPv4 and v6). The only real change would be removing the --ipam-opt which is trivial and would only require removing a single if block and updating a few tests. My main motivation for including the --ipam-opt was primarily to make my life easier integrating this feature with developer tooling that I work on, but it's nothing that I can't work around via checking for supported versions.
@akerouanton with the latest commits, I believe the PR is at a place where it matches the desired behavior.
I missed a unit test that was validating RequestPool with "all address" subnet requests (0.0.0.0/x, ::/x) and expecting the previous behavior (try to process it as an explicit subnet request). Updated the test to account for the new default pool allocation behavior.
Could you squash your commits please? It'd make it easier to review from an IDE. At a quick glance, I think there's no upfront refactoring, so it should probably go into a single commit.
Could you squash your commits please? It'd make it easier to review from an IDE. At a quick glance, I think there's no upfront refactoring, so it should probably go into a single commit.
Done!
@danegsta FYI we looked at your PR today with @robmry but we found that when the daemon restarts it picks a different subnet but keeps the previous gateway. This is wrong, but it doesn't come from your PR — it comes from this code path:
https://github.com/moby/moby/blob/8d4cb6e07168790c8eb4d976218ca70a3c31649d/daemon/libnetwork/controller.go#L774-L780
We should probably delete that code, and update the following method (or its caller) to pick either []*IpamConf (i.e. user's desired state) when no subnet was allocated beforehand, or reuse []*IpamInfo when there's one:
https://github.com/moby/moby/blob/8d4cb6e07168790c8eb4d976218ca70a3c31649d/daemon/libnetwork/network.go#L1532
(And we should probably remove all that pointer nonsense along the way.)
@robmry will submit a separate PR to fix that.
@danegsta FYI we looked at your PR today with @robmry but we found that when the daemon restarts it picks a different subnet but keeps the previous gateway. This is wrong, but it doesn't come from your PR — it comes from this code path:
https://github.com/moby/moby/blob/8d4cb6e07168790c8eb4d976218ca70a3c31649d/daemon/libnetwork/controller.go#L774-L780
We should probably delete that code, and update the following method (or its caller) to pick either
[]*IpamConf(i.e. user's desired state) when no subnet was allocated beforehand, or reuse[]*IpamInfowhen there's one:https://github.com/moby/moby/blob/8d4cb6e07168790c8eb4d976218ca70a3c31649d/daemon/libnetwork/network.go#L1532
(And we should probably remove all that pointer nonsense along the way.)
@robmry will submit a separate PR to fix that.
Looks like there are some some merge conflicts due to daemon/network.go and integration/daemon/daemon_linux_test.go updates, but nothing that look particularly awkward to merge my changes against.
Actually, I take that back; one implication of the upstream changes is the daemon won't be able to treat /24 style subnet requests as implicit 0.0.0.0/N or ::/N prefixes anymore as IPAMConfig.Subnet is being marshaled directly to a netip.Prefix in networkRouter.postNetworkCreate. Any normalization would have to happen in the CLI instead.
Actually, I take that back; one implication of the upstream changes is the daemon won't be able to treat
/24style subnet requests as implicit0.0.0.0/Nor::/Nprefixes anymore as IPAMConfig.Subnet is being marshaled directly to anetip.PrefixinnetworkRouter.postNetworkCreate. Any normalization would have to happen in the CLI instead.
Yes, agreed - we'll need to drop the shorthand and insist on having an unspecified address. That's a bit of a pity, but we'd already done that for dual-stack networks anyway, so perhaps this actually makes it a bit more consistent.
Also agree about the tests finding the gateway address without needing a restart - those tests will need an update.
As Albin mentioned, I'm looking at some refactoring to make it easier to deal with the daemon restart issue. Should have a PR ready soon.
Resolved the merge conflicts (including backing out the normalization logic) and both integration and unit tests should all be passing now.
Missed one other integration test failure in TestDaemonDefaultBridgeIPAM_UserBr that's fixed now.
https://github.com/moby/moby/pull/51134 is merged now ... so the change needed here is to check for the unspecified address in the subnet in ipamConfig (as well as a missing subnet in the config), and grab the previously allocated subnet from ipamInfo if there is one.
That should slot in here ... https://github.com/moby/moby/blob/a04b1aee35649645a350aefab4235db451ce22d2/daemon/libnetwork/network.go#L1561-L1566
Ideally, I think we should have an integration test that configures a network with just a prefix length, restarts the daemon, and checks the same subnet came back - something like TestPortMappingRestore.
Updated that block and added a new integration test that validates stable subnets for v4, v6, and dual stack networks on restart.
Updated that block and added a new integration test that validates stable subnets for v4, v6, and dual stack networks on restart.
Excellent, thank you! I'll take a final look through in the morning, but I think it's in good shape.
cc @dperny - want to take a look as well?
I'm assuming the remaining failing tests are likely noise or an issue with the github agent as they seem to be due to disk capacity issues on an unrelated integration test.
Yes, test failures look unrelated to this change. As you say, it seems a test runner is low on space. Strange to have hit the same thing on the first two runs, but I've given it another kick.
=== FAIL: amd64.docker.docker.integration.build TestBuildWithHugeFile (34.60s)
build_test.go:514: assertion failed: string "{\"stream\":\"Step 1/2 : FROM busybox\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e 1c35c4412082\\n\"}\r\n{\"stream\":\"Step 2/2 : RUN for g in $(seq 0 8); do dd if=/dev/urandom of=rnd bs=1K count=1 seek=$((1024*1024*g)) status=none; done \\u0026\\u0026 ls -la rnd \\u0026\\u0026 du -sk rnd\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e Running in 93b70daf87fb\\n\"}\r\n{\"stream\":\"-rw-r--r-- 1 root root 8589935616 Oct 10 08:16 rnd\\n\"}\r\n{\"stream\":\"36\\trnd\\n\"}\r\n{\"stream\":\" ---\\u003e Removed intermediate container 93b70daf87fb\\n\"}\r\n{\"errorDetail\":{\"message\":\"write /rnd: no space left on device\"},\"error\":\"write /rnd: no space left on device\"}\r\n" does not contain "Successfully built"
Can I trouble someone to re-run the one failing unit test?
Can I trouble someone to re-run the one failing unit test?
Looks like it's going to keep failing ... lots of stuff is broken today, but definitely unrelated to this PR.
Can I trouble someone to re-run the one failing unit test?
Looks like it's going to keep failing ... lots of stuff is broken today, but definitely unrelated to this PR.
Yeah, us-east-1 going down does tend to have some severe knock-on effects ☹️.
@danegsta You have one conflict to resolve, and we should be able to merge your PR afterward.