moby icon indicating copy to clipboard operation
moby copied to clipboard

Proposal: Allow hinting to ipam a preferred subnet size from default pool

Open danegsta opened this issue 7 months ago • 3 comments

- What I did

Added a new preferred_default_address_pool_size ipam option to hint a preferred subnet size for networks allocated from the default pool. This is treated as optional and invalid values (or values that would result in a larger subnet than configured for a default pool) are ignored.

Implements #50107

- How I did it

Added support for a new ipam option in the defaultipam implementation that, if present, will pass a preferred subnet size to addrSpace.allocatePredefinedPool and may result in a smaller subnet being allocated if specified.

- How to verify it

I've added unit tests for the new behavior, but the logic can be manually verified by passing the preferred_default_address_pool_size ipam option to docker network create with a smaller subnet size than would be provided by the default pool configurations (i.e. --ipam-opt preferred_default_address_pool_size=24).

- Human readable description for the release notes

- A picture of a cute animal (not mandatory but encouraged)

danegsta avatar May 30 '25 17:05 danegsta

One thing I haven't done in this PR is account for the case where only a subset of the available network pools may be able to fulfill the given preferred network size. For example, imagine pool A has a default size of /24 and pool B has a default size of /20 and the user requests a preferred size of /20. It likely makes sense to attempt to attempt to allocate an address from pool B first (even if pool A isn't fully allocated) before falling back to a /24 network from pool A if there's no available range in pool B.

If a preferred network size is requested, it's likely worth ordering the available pools by default pool size before attempting to allocate a subnet to ensure that subnets that can fulfill the preferred size (or are closes to the preferred size) get precedence.

danegsta avatar May 30 '25 18:05 danegsta

Thank you! It's perfectly fine for things to be incomplete, or even "throw away code". Having something to look at, or to give it a spin, and to have the discussion never hurts!

thaJeztah avatar May 30 '25 18:05 thaJeztah

I ended up adding a step to sort the pools if a preferred size is specified (with the idea being to prefer pools that are able to accommodate the preferred size when allocating a network).

My current design treats the default (or configured) pool prefix sizes as hard limits; the resulting network can have a smaller address range than the default, but never larger. For example, if a pool was configured with a default size of 24 and a request was made to create a network with a preferred prefix size of 16, any network from the pool would still be limited to a 24 prefix.

One other possible approach would be to honor the preferred prefix size as long as it was valid for the underlying base network for a given pool. For example, take a default pool configured with a base network of 172.17.0.0/12 with a default allocation size of 24. If a preferred size of 20 was given, that could be honored as it still fits within the overall /12 base subnet. In that case only the base prefix size would need to be considered when evaluating networks, not the default allocation size.

danegsta avatar May 30 '25 20:05 danegsta

Just found a previous PR/discussion related to address pool sizes. So, linking it here, but it looks like that work stalled in April last year ...

  • https://github.com/moby/moby/pull/47737#issuecomment-2078055376

robmry avatar Jul 25 '25 10:07 robmry

Just found a previous PR/discussion related to address pool sizes. So, linking it here, but it looks like that work stalled in April last year ...

I need to do some work to get this ready to move out of draft; I've been mostly heads down on an upcoming release at work, but should have time again to devote to this proposal now.

danegsta avatar Jul 25 '25 17:07 danegsta

I'm suspicious there'll still be a few changes warranted to this PR, but I think it's at a place where it's at ready for actual review.

danegsta avatar Sep 18 '25 23:09 danegsta

but it looks like that work stalled in April last year ...

Author of #47737 here, I was not given clear instructions or guidance on the next steps. There are unresolved arguments on whether "non network-savvy users can easily read and identify /22 subnets" and that's pretty much why it stalled. On my side all required code changes have been done and all I was doing was awaiting input.

iBug avatar Sep 19 '25 12:09 iBug

but it looks like that work stalled in April last year ...

Author of #47737 here, I was not given clear instructions or guidance on the next steps. There are unresolved arguments on whether "non network-savvy users can easily read and identify /22 subnets" and that's pretty much why it stalled. On my side all required code changes have been done and all I was doing was awaiting input.

I'd argue the two changes are complementary; I don't think anything in this PR invalidates the benefit of more reasonable defaults that increase the number of simultaneous networks available to the average Docker user. I see this PR as being more beneficial for developer tooling that interacts with Docker networking (coincidentally exactly what I work on, but I also recognize it's a more niche scenario). My main motivation with this PR is that, if I know that even 1024 unique addresses is more than a given network will require, I'd rather be a good steward of a developer's system resources and request a smaller range such as /24 (or even /25).

danegsta avatar Sep 19 '25 16:09 danegsta

[We discussed this PR during today's libnet maintainers call.

We all agreed that the --subnet-based UX is better than forcing users to specify an --ipam-opt.

Regarding API representation, we think that it's fine to put unspecified prefixes in IPAM.Config.Subnet. Older daemons will accept these values, but will fail to attach any container to such networks (because they'll try to allocate / assign the 0.0.0.1 address). There's not much we can do about that — clients will need to discover whether the daemon support this feature (by checking the API version).

We considered putting the subnet size in an IPAM option at the API level — i.e. while still allowing users to specify unspecified prefixes through --subnet — but this would require the CLI / Compose to translate subnet values into IPAM options. Since this would be a new CLI / Compose behavior, older versions would send these as IPAM.Config.Subnet, so that wouldn't solve the initial compatibility issue (i.e. consumers would have to check CLI / Compose version to discover whether they support that translation, plus they'd have to check the API version to know whether the daemon will honor it).

Regarding this comment:

My current design treats the default (or configured) pool prefix sizes as hard limits; the resulting network can have a smaller address range than the default, but never larger. For example, if a pool was configured with a default size of 24 and a request was made to create a network with a preferred prefix size of 16, any network from the pool would still be limited to a 24 prefix.

We're rather erring on the side of making the subnet size a mandatory requirement the Engine has to honor, instead of an hint that the Engine may decide to ignore. The reason is that we're considering https://github.com/moby/moby/pull/47737 as a next step, and we'd like to take that PR even further by making the default subnet size a global config parameter instead of a per address pool parameter as is currently the case.

akerouanton avatar Sep 30 '25 17:09 akerouanton

[We discussed this PR during today's libnet maintainers call.

We all agreed that the --subnet-based UX is better than forcing users to specify an --ipam-opt.

Regarding API representation, we think that it's fine to put unspecified prefixes in IPAM.Config.Subnet. Older daemons will accept these values, but will fail to attach any container to such networks (because they'll try to allocate / assign the 0.0.0.1 address). There's not much we can do about that — clients will need to discover whether the daemon support this feature (by checking the API version).

We considered putting the subnet size in an IPAM option at the API level — i.e. while still allowing users to specify unspecified prefixes through --subnet — but this would require the CLI / Compose to translate subnet values into IPAM options. Since this would be a new CLI / Compose behavior, older versions would send these as IPAM.Config.Subnet, so that wouldn't solve the initial compatibility issue (i.e. consumers would have to check CLI / Compose version to discover whether they support that translation, plus they'd have to check the API version to know whether the daemon will honor it).

Regarding this comment:

My current design treats the default (or configured) pool prefix sizes as hard limits; the resulting network can have a smaller address range than the default, but never larger. For example, if a pool was configured with a default size of 24 and a request was made to create a network with a preferred prefix size of 16, any network from the pool would still be limited to a 24 prefix.

We're rather erring on the side of making the subnet size a mandatory requirement the Engine has to honor, instead of an hint that the Engine may decide to ignore. The reason is that we're considering #47737 as a next step, and we'd like to take that PR even further by making the default subnet size a global config parameter instead of a per address pool parameter as is currently the case.

Sounds good, the current implementation is fairly close to this already; I'm treating the subnet size as mandatory (when specified) and honoring it via --subnet (works for multiple subnets including a mix of IPv4 and v6). The only real change would be removing the --ipam-opt which is trivial and would only require removing a single if block and updating a few tests. My main motivation for including the --ipam-opt was primarily to make my life easier integrating this feature with developer tooling that I work on, but it's nothing that I can't work around via checking for supported versions.

danegsta avatar Sep 30 '25 18:09 danegsta

@akerouanton with the latest commits, I believe the PR is at a place where it matches the desired behavior.

danegsta avatar Sep 30 '25 18:09 danegsta

I missed a unit test that was validating RequestPool with "all address" subnet requests (0.0.0.0/x, ::/x) and expecting the previous behavior (try to process it as an explicit subnet request). Updated the test to account for the new default pool allocation behavior.

danegsta avatar Oct 01 '25 17:10 danegsta

Could you squash your commits please? It'd make it easier to review from an IDE. At a quick glance, I think there's no upfront refactoring, so it should probably go into a single commit.

akerouanton avatar Oct 02 '25 09:10 akerouanton

Could you squash your commits please? It'd make it easier to review from an IDE. At a quick glance, I think there's no upfront refactoring, so it should probably go into a single commit.

Done!

danegsta avatar Oct 02 '25 16:10 danegsta

@danegsta FYI we looked at your PR today with @robmry but we found that when the daemon restarts it picks a different subnet but keeps the previous gateway. This is wrong, but it doesn't come from your PR — it comes from this code path:

https://github.com/moby/moby/blob/8d4cb6e07168790c8eb4d976218ca70a3c31649d/daemon/libnetwork/controller.go#L774-L780

We should probably delete that code, and update the following method (or its caller) to pick either []*IpamConf (i.e. user's desired state) when no subnet was allocated beforehand, or reuse []*IpamInfo when there's one:

https://github.com/moby/moby/blob/8d4cb6e07168790c8eb4d976218ca70a3c31649d/daemon/libnetwork/network.go#L1532

(And we should probably remove all that pointer nonsense along the way.)

@robmry will submit a separate PR to fix that.

akerouanton avatar Oct 07 '25 17:10 akerouanton

@danegsta FYI we looked at your PR today with @robmry but we found that when the daemon restarts it picks a different subnet but keeps the previous gateway. This is wrong, but it doesn't come from your PR — it comes from this code path:

https://github.com/moby/moby/blob/8d4cb6e07168790c8eb4d976218ca70a3c31649d/daemon/libnetwork/controller.go#L774-L780

We should probably delete that code, and update the following method (or its caller) to pick either []*IpamConf (i.e. user's desired state) when no subnet was allocated beforehand, or reuse []*IpamInfo when there's one:

https://github.com/moby/moby/blob/8d4cb6e07168790c8eb4d976218ca70a3c31649d/daemon/libnetwork/network.go#L1532

(And we should probably remove all that pointer nonsense along the way.)

@robmry will submit a separate PR to fix that.

Looks like there are some some merge conflicts due to daemon/network.go and integration/daemon/daemon_linux_test.go updates, but nothing that look particularly awkward to merge my changes against.

danegsta avatar Oct 07 '25 17:10 danegsta

Actually, I take that back; one implication of the upstream changes is the daemon won't be able to treat /24 style subnet requests as implicit 0.0.0.0/N or ::/N prefixes anymore as IPAMConfig.Subnet is being marshaled directly to a netip.Prefix in networkRouter.postNetworkCreate. Any normalization would have to happen in the CLI instead.

danegsta avatar Oct 07 '25 18:10 danegsta

Actually, I take that back; one implication of the upstream changes is the daemon won't be able to treat /24 style subnet requests as implicit 0.0.0.0/N or ::/N prefixes anymore as IPAMConfig.Subnet is being marshaled directly to a netip.Prefix in networkRouter.postNetworkCreate. Any normalization would have to happen in the CLI instead.

Yes, agreed - we'll need to drop the shorthand and insist on having an unspecified address. That's a bit of a pity, but we'd already done that for dual-stack networks anyway, so perhaps this actually makes it a bit more consistent.

Also agree about the tests finding the gateway address without needing a restart - those tests will need an update.

As Albin mentioned, I'm looking at some refactoring to make it easier to deal with the daemon restart issue. Should have a PR ready soon.

robmry avatar Oct 07 '25 18:10 robmry

Resolved the merge conflicts (including backing out the normalization logic) and both integration and unit tests should all be passing now.

danegsta avatar Oct 07 '25 19:10 danegsta

Missed one other integration test failure in TestDaemonDefaultBridgeIPAM_UserBr that's fixed now.

danegsta avatar Oct 07 '25 21:10 danegsta

https://github.com/moby/moby/pull/51134 is merged now ... so the change needed here is to check for the unspecified address in the subnet in ipamConfig (as well as a missing subnet in the config), and grab the previously allocated subnet from ipamInfo if there is one.

That should slot in here ... https://github.com/moby/moby/blob/a04b1aee35649645a350aefab4235db451ce22d2/daemon/libnetwork/network.go#L1561-L1566

Ideally, I think we should have an integration test that configures a network with just a prefix length, restarts the daemon, and checks the same subnet came back - something like TestPortMappingRestore.

robmry avatar Oct 08 '25 16:10 robmry

Updated that block and added a new integration test that validates stable subnets for v4, v6, and dual stack networks on restart.

danegsta avatar Oct 08 '25 17:10 danegsta

Updated that block and added a new integration test that validates stable subnets for v4, v6, and dual stack networks on restart.

Excellent, thank you! I'll take a final look through in the morning, but I think it's in good shape.

robmry avatar Oct 08 '25 18:10 robmry

cc @dperny - want to take a look as well?

corhere avatar Oct 09 '25 18:10 corhere

I'm assuming the remaining failing tests are likely noise or an issue with the github agent as they seem to be due to disk capacity issues on an unrelated integration test.

danegsta avatar Oct 10 '25 16:10 danegsta

Yes, test failures look unrelated to this change. As you say, it seems a test runner is low on space. Strange to have hit the same thing on the first two runs, but I've given it another kick.

=== FAIL: amd64.docker.docker.integration.build TestBuildWithHugeFile (34.60s)
    build_test.go:514: assertion failed: string "{\"stream\":\"Step 1/2 : FROM busybox\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e 1c35c4412082\\n\"}\r\n{\"stream\":\"Step 2/2 : RUN for g in $(seq 0 8); do dd if=/dev/urandom of=rnd bs=1K count=1 seek=$((1024*1024*g)) status=none; done  \\u0026\\u0026 ls -la rnd \\u0026\\u0026 du -sk rnd\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e Running in 93b70daf87fb\\n\"}\r\n{\"stream\":\"-rw-r--r--    1 root     root     8589935616 Oct 10 08:16 rnd\\n\"}\r\n{\"stream\":\"36\\trnd\\n\"}\r\n{\"stream\":\" ---\\u003e Removed intermediate container 93b70daf87fb\\n\"}\r\n{\"errorDetail\":{\"message\":\"write /rnd: no space left on device\"},\"error\":\"write /rnd: no space left on device\"}\r\n" does not contain "Successfully built"

robmry avatar Oct 11 '25 05:10 robmry

Can I trouble someone to re-run the one failing unit test?

danegsta avatar Oct 20 '25 17:10 danegsta

Can I trouble someone to re-run the one failing unit test?

Looks like it's going to keep failing ... lots of stuff is broken today, but definitely unrelated to this PR.

robmry avatar Oct 20 '25 18:10 robmry

Can I trouble someone to re-run the one failing unit test?

Looks like it's going to keep failing ... lots of stuff is broken today, but definitely unrelated to this PR.

Yeah, us-east-1 going down does tend to have some severe knock-on effects ☹️.

danegsta avatar Oct 20 '25 18:10 danegsta

@danegsta You have one conflict to resolve, and we should be able to merge your PR afterward.

akerouanton avatar Oct 22 '25 15:10 akerouanton