Files
moby/libnetwork/netutils/utils_linux.go
Albin Kerouanton 9d288b5b43 libnet/i/defaultipam: introduce a linear allocator
The previous allocator was subnetting address pools eagerly
when the daemon started, and would then just iterate over that
list whenever RequestPool was called. This was leading to high
memory usage whenever IPv6 pools were configured with a target
subnet size too different from the pools prefix size.

For instance: pool = fd00::/8, target size = /64 -- 2 ^ (64-8)
subnets would be generated upfront. This would take approx.
9 * 10^18 bits -- way too much for any human computer in 2024.

Another noteworthy issue, the previous implementation was allocating
a subnet, and then in another layer was checking whether the
allocation was conflicting with some 'reserved networks'. If so,
the allocation would be retried, etc... To make it worse, 'reserved
networks' would be recomputed on every iteration. This is totally
ineffective as there could be 'reserved networks' that fully overlap
a given address pool (or many!).

To fix this issue, a new field `Exclude` is added to `RequestPool`.
It's up to each driver to take it into account. Since we don't know
whether this retry loop is useful for some remote IPAM driver, it's
reimplemented bug-for-bug directly in the remote driver.

The new allocator uses a linear-search algorithm. It takes advantage
of all lists (predefined pools, allocated subnets and reserved
networks) being sorted and logically combines 'allocated' and
'reserved' through a 'double cursor' to iterate on both lists at the
same time while preserving the total order. At the same time, it
iterates over 'predefined' pools and looks for the first empty space
that would be a good fit.

Currently, the size of the allocated subnet is still dictated by
each 'predefined' pools. We should consider hardcoding that size
instead, and let users specify what subnet size they want. This
wasn't possible before as the subnets were generated upfront. This
new allocator should be able to deal with this easily.

The method used for static allocation has been updated to make sure
the ascending order of 'allocated' is preserved. It's bug-for-bug
compatible with the previous implementation.

One consequence of this new algorithm is that we don't keep track
of where the last allocation happened, we just allocate the first
free subnet we find.

Before:

- Allocate: 10.0.1.0/24, 10.0.2.0/24 ; Deallocate: 10.0.1.0/24 ;
Allocate 10.0.3.0/24.

Now, the 3rd allocation would yield 10.0.1.0/24 once again.

As it doesn't change the semantics of the allocator, there's no
reason to worry about that.

Finally, about 'reserved networks'. The heuristics we use are
now properly documented. It was discovered that we don't check
routes for IPv6 allocations -- this can't be changed because
there's no such thing as on-link routes for IPv6.

(Kudos to Rob Murray for coming up with the linear-search idea.)

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
2024-05-23 08:24:51 +02:00

123 lines
4.3 KiB
Go

//go:build linux
// Network utility functions.
package netutils
import (
"net/netip"
"os"
"slices"
"github.com/docker/docker/libnetwork/internal/netiputil"
"github.com/docker/docker/libnetwork/ns"
"github.com/docker/docker/libnetwork/resolvconf"
"github.com/docker/docker/libnetwork/types"
"github.com/pkg/errors"
"github.com/vishvananda/netlink"
)
// InferReservedNetworks returns a list of network prefixes that seem to be
// used by the system and that would likely break it if they were assigned to
// some Docker networks. It uses two heuristics to build that list:
//
// 1. Nameservers configured in /etc/resolv.conf ;
// 2. On-link routes ;
//
// That 2nd heuristic was originally not limited to on-links -- all non-default
// routes were checked (see [1]). This proved to be not ideal at best and
// highly problematic at worst:
//
// - VPN software and appliances doing split tunneling might push a small set
// of routes for large, aggregated prefixes to avoid maintenance and
// potential issues whenever a new subnet comes into use on internal
// network. However, not all subnets from these aggregates might be in use.
// - For full tunneling, especially when implemented with OpenVPN, the
// situation is even worse as the host might end up with the two following
// routes: 0.0.0.0/1 and 128.0.0.0/1. They are functionally
// indistinguishable from a default route, yet the Engine was treating them
// differently. With those routes, there was no way to use dynamic subnet
// allocation at all. (see 'def1' on [2])
// - A subnet covered by the default route can be used, or not. Same for
// non-default and non-on-link routes. The type of route says little about
// the availability of subnets it covers, except for on-link routes as they
// specifically define what subnet the current host is part of.
//
// The 2nd heuristic was modified to be limited to on-link routes in PR #42598
// (first released in v23.0, see [3]).
//
// If these heuristics don't detect an overlap, users should change their daemon
// config to remove that overlapping prefix from `default-address-pools`. If a
// prefix is found to overlap but users care enough about it being associated
// to a Docker network they can still rely on static allocation.
//
// For IPv6, the 2nd heuristic isn't applied as there's no such thing as
// on-link routes for IPv6.
//
// [1]: https://github.com/moby/libnetwork/commit/56832d6d89bf0f9d5280849026ee25ae4ae5f22e
// [2]: https://community.openvpn.net/openvpn/wiki/Openvpn23ManPage
// [3]: https://github.com/moby/moby/pull/42598
func InferReservedNetworks(v6 bool) []netip.Prefix {
var reserved []netip.Prefix
// We don't really care if os.ReadFile fails here. It either doesn't exist,
// or we can't read it for some reason.
if rc, err := os.ReadFile(resolvconf.Path()); err == nil {
reserved = slices.DeleteFunc(resolvconf.GetNameserversAsPrefix(rc), func(p netip.Prefix) bool {
return p.Addr().Is6() != v6
})
}
if !v6 {
reserved = append(reserved, queryOnLinkRoutes()...)
}
slices.SortFunc(reserved, netiputil.PrefixCompare)
return reserved
}
// queryOnLinkRoutes returns a list of on-link routes available on the host.
// Only IPv4 prefixes are returned as there's no such thing as on-link
// routes for IPv6.
func queryOnLinkRoutes() []netip.Prefix {
routes, err := ns.NlHandle().RouteList(nil, netlink.FAMILY_V4)
if err != nil {
return nil
}
var prefixes []netip.Prefix
for _, route := range routes {
if route.Dst != nil && route.Scope == netlink.SCOPE_LINK {
if p, ok := netiputil.ToPrefix(route.Dst); ok {
prefixes = append(prefixes, p)
}
}
}
return prefixes
}
// GenerateIfaceName returns an interface name using the passed in
// prefix and the length of random bytes. The api ensures that the
// there are is no interface which exists with that name.
func GenerateIfaceName(nlh *netlink.Handle, prefix string, len int) (string, error) {
linkByName := netlink.LinkByName
if nlh != nil {
linkByName = nlh.LinkByName
}
for i := 0; i < 3; i++ {
name, err := GenerateRandomName(prefix, len)
if err != nil {
return "", err
}
_, err = linkByName(name)
if err != nil {
if errors.As(err, &netlink.LinkNotFoundError{}) {
return name, nil
}
return "", err
}
}
return "", types.InternalErrorf("could not generate interface name")
}