moby icon indicating copy to clipboard operation
moby copied to clipboard

Generate split subnets on-demand and properly shift bytes of IPv6 subnets

Open akerouanton opened this issue 3 years ago • 19 comments

I bundled these two commits together as the second one is easier to test with changes done in the first one. I can still break this PR into two parts and rewrite the unit test of the second commit to test it with the current master branch. Let me know if you prefer that (eg. if you want to cherry-pick this change into older releases or because it's a big diff to review).

- 1st commit

This commit resolves #40275 by implementing a custom iterator named NetworkSplitter. It splits a set of NetworkToSplit into smaller subnets on-demand by calling its Get method.

Prior to this change, the list of NetworkToSplit was split into smaller subnets when ConfigLocalScopeDefaultNetworks or ConfigGlobalScopeDefaultNetworks were called or when ipamutils package was loaded. When one of the Config function was called with an IPv6 net to split into small subnets all the available memory was consumed. For instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the NetworkSplitter is used by libnetwork/ipam package in such a way that it only have to compute the address of the next subnet. When NetworkSplitter reach the end of NetworkToSplit, it's resetted by libnetwork/ipam only if there were some subnets released beforehand. In such case, ipam package might iterate over all the subnets before finding one available subnet. This is the worst-case but it shall not be really impactful as either many subnets exists (more than one host can handle) or not so many subnets exists and full iteration over NetworkSplitter is fast.

- 2nd commit

Prior to this change, IPv6 subnets generation was broken because of a bitwise shift overflowing uint64.

To solve this issue, addIntToIP has been changed to take an addend and computes two ordinals applied to respectively the lower 64 bits and the upper 64 bits of IPv6 addresses.

Nothing change for IPv4 (ie. upperOrdinal is always 0).

- How to verify it

I added unit tests to confirm these changes are okay. Moreover, I tried to run dockerd with following config:

{
	"log-level": "info",
	"ipv6": true,
	"experimental": true,
	"ip6tables": true,
	"fixed-cidr-v6": "2a01:e34:xxxx:yyyy::/64",
	"default-address-pools": [
		{ "base": "172.17.0.0/16", "size": 16 },
		{ "base": "172.18.0.0/16", "size": 16 },
		{ "base": "2a01:e34:xxxx:yyyy::/64", "size": 120 }
	]
}

When starting dockerd v20.10.10 with this config, the daemon crashes with this panic message message: runtime error: makeslice: cap out of range. When starting a patched version of dockerd, it doesn't crash.

- Description for the changelog

  • Generate split subnets on-demand to not consume all the available memory when small IPv6 subnets are created from big subnets (fixes #40275)
  • Properly shift bytes of IPv6 subnets (fixes #42801)

akerouanton avatar Nov 19 '21 02:11 akerouanton

Hey, awesome that somebody is giving ipv6 some love 👍 Any chance this is fixing #41438 aswell ?

Xyaren avatar Nov 25 '21 22:11 Xyaren

@Xyaren I still need a bit of time to test it properly but yeah, it should fix #41438 as (I believe) this is a consequence of the uint64 overflow :wink: Thanks for mentioning this issue :slightly_smiling_face:

akerouanton avatar Nov 26 '21 11:11 akerouanton

@thaJeztah PTAL :slightly_smiling_face: (I don't know who else I should ping. Maybe arkodg?)

akerouanton avatar Nov 29 '21 00:11 akerouanton

So.. about 5 month later... Is there anyone able to look at this please ?

Xyaren avatar Apr 29 '22 16:04 Xyaren

@akerouanton looks like some failures in a test (may need some extra options in it);

=== Failed
=== FAIL: github.com/docker/docker/daemon/config TestDaemonConfigurationMergeDefaultAddressPools/empty_config_file (0.00s)
    config_test.go:155: assertion failed: cannot handle unexported field at {[]*ipamutils.NetworkToSplit}[0].ipVersion:
        	"github.com/docker/docker/libnetwork/ipamutils".NetworkToSplit
        consider using a custom Comparer; if you control the implementation of type, you can also consider using an Exporter, AllowUnexported, or cmpopts.IgnoreUnexported
    --- FAIL: TestDaemonConfigurationMergeDefaultAddressPools/empty_config_file (0.00s)
=== FAIL: github.com/docker/docker/daemon/config TestDaemonConfigurationMergeDefaultAddressPools/config_file (0.00s)
    config_test.go:165: assertion failed: cannot handle unexported field at {[]*ipamutils.NetworkToSplit}[0].ipVersion:
        	"github.com/docker/docker/libnetwork/ipamutils".NetworkToSplit
        consider using a custom Comparer; if you control the implementation of type, you can also consider using an Exporter, AllowUnexported, or cmpopts.IgnoreUnexported
    --- FAIL: TestDaemonConfigurationMergeDefaultAddressPools/config_file (0.00s)

thaJeztah avatar Jun 07 '22 20:06 thaJeztah

@thaJeztah Ah, looks like a bad rebase. Should be fixed now.

akerouanton avatar Jun 11 '22 22:06 akerouanton

@thaJeztah Any chance to have another look at this? Would be really nice to have that fixed!

max-wittig avatar Sep 07 '22 12:09 max-wittig

@thaJeztah @cpuguy83 Friendly ping 😄 . We've added IPv6 networking support to the gitlab-runner, but we have no way to use it as IPv6 networking in Docker is still broken, when used in this way.

Is there any way I can support the efforts here?

max-wittig avatar Oct 10 '22 07:10 max-wittig

@akerouanton Would you mind to implement these small requested changes? It would really help us to use Docker with IPv6 🥺

max-wittig avatar Nov 01 '22 15:11 max-wittig

@akerouanton Looks like the linter caught a typo;

libnetwork/ipamutils/utils.go:160:19: `primarly` is a misspelling of `primarily` (misspell)
// This method is primarly used to duplicate NetworkSplitter in tests.
                  ^

And there's one test-failure (haven't looked if its a legit failure, or a test that needs updating);

=== FAIL: github.com/docker/docker/libnetwork/ipamutils TestInitPoolWithIPv6 (0.00s)
    utils_test.go:154: assertion failed: <nil> (string) != fd00:0:ffff:ffff:ffff:ffff::/96 (string)
    utils_test.go:158: assertion failed: <nil> (string) != fd00:0:ffff:ffff:ffff:ffff::/96 (string)

thaJeztah avatar Dec 23 '22 15:12 thaJeztah

@thaJeztah The failing test looks legit but I didn't had time to debug it. I'll fix that later and ping you when it's done 😉

akerouanton avatar Dec 23 '22 15:12 akerouanton

@thaJeztah I'm not sure why the integration-cli job is failing. I don't reproduce this error on my computer. Could you take a look please?

akerouanton avatar Dec 28 '22 16:12 akerouanton

Hm... not sure; would have to check why it fails. I can give it a kick (in case there's some flakiness in that test, or (e.g.) some other daemon was still running).

Copied the error below (in case we need to revisit)

=== Failed
=== FAIL: amd64.integration-cli TestDockerNetworkSuite/TestDockerNetworkIPAMMultipleNetworks (15.30s)
    docker_cli_network_unix_test.go:588: assertion failed: 
        Command:  /usr/local/cli/docker network create test3
        ExitCode: 1
        Error:    exit status 1
        Stdout:   
        Stderr:   Error response from daemon: could not find an available, non-overlapping IPv4 address pool among the defaults to assign to the network
        
        
        Failures:
        ExitCode was 1 expected 0
        Expected no error
    docker_cli_network_unix_test.go:46: [d17053f569bde] daemon is not started
    --- FAIL: TestDockerNetworkSuite/TestDockerNetworkIPAMMultipleNetworks (15.30s)

=== FAIL: amd64.integration-cli TestDockerNetworkSuite (769.49s)

thaJeztah avatar Dec 28 '22 17:12 thaJeztah

It looks like a legitimate test failure. In addition, the commits are unsigned and you have verification turned on. Could you make sure your commits are signed after you solve the test issue?

neersighted avatar Jan 03 '23 15:01 neersighted

@akerouanton some test failures are known/from a stale head. Can you please rebase onto master?

neersighted avatar Jan 09 '23 15:01 neersighted

@neersighted Already did that before my last force-push :wink:

akerouanton avatar Jan 09 '23 15:01 akerouanton

Hmm, odd we're still seeing the buildkit-related ones then:

    --- FAIL: TestIntegration/TestMountRWCache/worker=dockerd/frontend=client (1.87s)
    --- FAIL: TestIntegration/TestMountRWCache/worker=dockerd/frontend=builtin (1.47s)

I might have to do some digging later, after the other test failures are resolved, I guess.

neersighted avatar Jan 09 '23 15:01 neersighted

I think all the failing tests are flaky. From my standpoint, this PR is ready for review. Could you maybe restart the failing job, to see if those errors happen again?

akerouanton avatar Jan 09 '23 16:01 akerouanton

@neersighted I just rebased onto master, hopefully the CI will be green this time.

akerouanton avatar Jan 12 '23 11:01 akerouanton

Is this PR still active? It's still causing massive headaches in https://github.com/moby/moby/issues/42801 and https://github.com/moby/moby/issues/41438. And it seems as it only requires rebasing?

TobTobXX avatar Apr 10 '23 13:04 TobTobXX

So, after some more tests, it turns out this PR makes clear that either the way we check for overlapping subnets is quite naive (it works well for IPv4, but not for IPv6) and/or the current Subnetter is the wrong abstraction.

For instance, if a user specify a default address pool with the base CIDR 10.0.0.0/8 and a subnet size of 30 in their daemon.json, and set bip config param as 10.0.0.1/8, when the user want to create a new custom network the daemon has to iterate over all 2^(30-8) subnets (~ 4 millions) before returning an error. This takes ~ 0.5s (on my computer) and the issue is not really noticeable. Unfortunately, with IPv6 the worst case is so much bigger than it'd become unusable: with a /64 address pool split into /120 subnets, the current implementation has to iterate over 2^56 possible subnets (~ 7 * 10^16) before returning an error.

$ cat /etc/docker/daemon.json
{
	"log-level": "info",
	"bip": "10.0.0.1/8",
	"fixed-cidr": "10.0.0.0/8",
	"default-address-pools": [
		{ "base": "10.0.0.0/8", "size": 30 }
	]
}
$ docker network create testnet1

Now, another case where that same bug happens. Two default address pools are defined: the first one yielding a single subnet and matching the bip/fixed-cidr params, and another, random, address pool (eg. 10.0.0.0/8 with size 24). Now if the user creates a network with a subnet matching the 2nd address pool CIDR, and then try to create a 3rd network with no subnet specified, all subnets from the 2nd address pool are enumerated before (*addrSpace).allocatePredefinedPool() returns an error.

$ cat /etc/docker/daemon.json
{
	"bip": "172.19.0.1/24",
	"fixed-cidr": "172.19.0.0/24",
	"default-address-pools": [
		{ "base": "172.19.0.0/24", "size": 24 },
		{ "base": "10.0.0.0/8", "size": 24 }
	]
}
$ docker network create --subnet=10.0.0.0/8 testnet1
$ docker network create testnet2

Since addrSpace is locked during the "sometimes almost endless" loop (in (*addrSpace).allocatePredefinedPool()), the network controller can neither request nor release addresses and subnets in the mean time.

I can only come up with these two options to fix this issue:

  • Disallow bip/fixed-cidr to overlap with one of the default address pools, and disallow custom networks with a user defined subnet to collide with one of the default address pools. We might break things if we do so, so probably not a good idea.
  • Make the Subnetter aware of what subnets are currently in use.
    • Unfortunately, just moving its internal offset to ignore custom subnets won't be enough since the whole subnetter might be reset (and it's currently always reset at start up because of the way the default bridge is restored). Once reset, we're back to square one.
    • The other option is to merge Subnetter into addrSpace, to make it able to yield new subnets based on default address pools configured and user-supplied custom subnets. This poses the question of what data structure to use for that purpose. I can't think of anything else than LC/LPC/radix tries, but I'm wondering if a simpler data structure could help solve this issue.

akerouanton avatar Apr 14 '23 14:04 akerouanton

Earlier response

For instance, if a user specify a default address pool with the base CIDR 10.0.0.0/8 and a subnet size of 30 in their daemon.json, and set bip config param as 10.0.0.1/8, when the user want to create a new custom network the daemon has to iterate over all 2^(30-8) subnets (~ 4 millions) before returning an error.

Realistically that many subnets is not going to be used. Current defaults AFAIK has a much smaller amount of subnets available in the pool? (IIRC, this is poorly documented though)

The issue here though is:

  • Address pool with /8 base to be subdivided into /30 subnets.
  • Any other network setting an explicit subnet that overlaps (the /8).

Only one of those configurations is valid to support at a time, and could be detected when starting the daemon?

  • If the /8 is already reserved for the docker0 bridge, then the address pool base of /8 or smaller would be invalid before considering size.
  • If address pools come first, you've already reserved the entire /8 subnet, again size can be ignored since it's not relevant, assigning a /8 for docker0 should fail.

Same logic applies for IPv6.


Now if the user creates a network with a subnet matching the 2nd address pool CIDR, and then try to create a 3rd network with no subnet specified, all subnets from the 2nd address pool are enumerated before (*addrSpace).allocatePredefinedPool() returns an error.

This is a bit more interesting. But same concern, it's not that pragmatic to iterate through unallocated pools like that when realistically the pools already convey reservation of the base that is split into subnets of size.

This check is only important when creating a new network correct? Networks that are created persist their config with the assigned subnet, and pragmatically, how large does the number of assigned subnets get? It isn't likely to be that expensive to track with a hashmap is it?

EDIT: I might have misunderstood, is this about:

  • docker network create --subnet 10.0.0.0/8 (not 10.0.0.0/24?)
  • docker network create (implicitly tries to assign a 10.0.0.0/24?)

Doesn't really change my feedback though either way.


Disallow bip/fixed-cidr to overlap with one of the default address pools, and disallow custom networks with a user defined subnet to collide with one of the default address pools. We might break things if we do so, so probably not a good idea.

Why is it not viable to track what subnets are actually in use? The pool is just defining blocks of address space that can be assigned a subnet of size when one is not explicitly requested.

The other option is to merge Subnetter into addrSpace, to make it able to yield new subnets based on default address pools configured and user-supplied custom subnets. This poses the question of what data structure to use for that purpose. I can't think of anything else than LC/LPC/radix tries, but I'm wondering if a simpler data structure could help solve this issue.

Seems we're on the same page :grinning: :+1:

Kubernetes uses bitmaps apparently. Roaring compressed bitmaps were adopted as an improvement. There is some recent related work on k8s, so I think they're still preferring that.

There's also a nice article on investigating a few data structures for working with IPs (Apnic), although I'm not sure how well any handle range queries to detect overlapping subnets :sweat_smile:


I put something together with some rust code, data structure isn't simpler but I deferred that to existing libraries.

impl SubnetTree for SubnetManager<IPv6> {
    fn add(self: &mut Self, cidr_block: &str) {
      let subnet = Self::subnet_from_range(cidr_block);

      if let Some(overlapped) = self.tree.search(subnet).next() {
        println!("{} overlaps IP range with subnet {}", cidr_block, overlapped.data);
      } else {
        self.tree.insert(subnet, cidr_block.to_string());
        println!("Added subnet {}", cidr_block);
      }
    }
  }

Main logic is mostly that, and can be used like this:

fn main() {
  let mut subnets = SubnetManager::<IPv6>::new();
  let existing = ["fd00:feed:face:cafe::/64", "fd00:feed:face:beef::/64"];

  // These add without issue:
  for cidr_block in existing {
    subnets.add(cidr_block);
  }

  // These will fail due to overlap:
  subnets.add("fd00:feed:face:beef::/48");
  subnets.add("fd00:feed:face::/48");
}

Works with IPv4 as well, although I just used two different trees for that. Each tree sets dimensions used by the number of octets (4 for IPv4, 16 for IPv6), which AFAIK simplifies the range queries (I've not done any measurements).


UPDATE Nov 2023: Looks like a PR similar to the overlap check in my above example was implemented here: https://github.com/moby/moby/pull/46755

polarathene avatar Apr 17 '23 15:04 polarathene

I'm going to close this PR in favor of https://github.com/moby/moby/pull/47768. As commented here, this implementation is a dead end.

akerouanton avatar Apr 26 '24 16:04 akerouanton