libnetwork
libnetwork copied to clipboard
bitseq: Unselected() does not check datastore
The problem
When calling RequestAddress
on an IPAM, the ErrNoAvailableIPs
error can be raised even if there are remaining addresses when some conditions are met.
This comes from the h.Unselected() <= 0
condition beeing checked on the bitmask without synchronizing it with the datastore first (ipam/allocator.go:534).
How to reproduce
The bug was found when using a swarm legacy cluster with a consul datastore (Docker version 17.05.0-ce).
- Create an overlay network small with a small enough IP range.
- On a first node, start as much containers as the network allows it.
- On a second node, try to start a container on this network => "no available IPv4 addresses". Now the local bitseq is synchronised with consul. The IPAM's
Alllocator
is such thata.addresses[k].Unselected() == 0
,k
beeing the network subnet - On node n°1, stop one (or more) of the containers. The bitmask in the consul datastore is now correctly updated, but node n°2 was not notified of this in any way.
- On node n °2, try to start a container again => same error message. We still have
a.addresses[k].Unselected() == 0
and no call to consul is made to refresh the bitmask. From this point this is impossible to start any container on the network using this node. Restarting the docker daemon fixes the problem.
I'll submit a (naive) patch to handle this problem soon.
@abhinandanpb can you take a look?
For information, I came up with a pretty straightforward workaround here: fc84371
It simply consists in removing checks on handle.Unselected() == 0
. The ErrNoAvailableIPs
will eventually be raised by handle.set(...)
when needed after fetching fresh data from the store. There must be a minor performance hit but functionally it seems to be the same (I couldn't manage to run the test suite of the project so I'm just assuming this).
I'm not making a PR now because:
- I suppose there must be a better and less brutal way to fix this issue
- I don't really know how test this (although when building the docker daemon with my fork of libnetwork the bug is fixed)
@jlacoline thanks for reporting the issue and also following up with the pausible clause. On the outset it does look like a bug. We do need to check with datastore to get the first hand information. Allow me some time to take another look and suggest a fix if you are interested in raising the PR.
ping @abhi
Hello @abhi @fcrisciani Are there news about this bug ? Thanks in advance