ec2metaproxy
ec2metaproxy copied to clipboard
Deal with network=host containers
An idea from https://github.com/kubernetes/kubernetes/issues/14226#issuecomment-184328220:
For network=host containers,
it could look up the connection in /proc/net/tcp/* instead and match that against the /proc/*/fd/ symlinks, like lsof does. That's expensive, unless there's a way in iptables to munge the source IP/port to reduce the search space... loopback has this whole 127.0.0.0/8 range, after all. I'm not going to propose LD_PRELOAD or similar hacks. :-)
Another idea would be to only search in /proc directories where we know that a) there's a container and, ideally, b) it's a network=host container. Maybe this would be feasible only if ec2metaproxy were a library, as in #5.
This is the major shortcoming of the proxy. There are two issues that need to be resolved to make it work. The first is what you mention, mapping the request source port to a container. The second issue is how to do it and still allow the proxy to connect to the real metadata service. Somehow you have to configure iptables to only re-route non-ec2metaproxy packets. Otherwise you get an infinite loop.
I welcome ideas on how to resolve this. I learned enough about iptables to write the current rules for the proxy, so I don't have a lot of expertise there. Maybe the metadata proxy can run with it's own network bridge? That would make deployment a bit more complex.
The performance may or may not be a big issue. The AWS SDKs will cache the credentials until they expire, so you should only be paying the price once an hour for a container. I guess it depends on what the container is doing.
A few rough ideas:
- iptables has support for cmd-owner, gid-owner and uid-owner. We could match the command name and mark packets. Or perhaps the admin can run the proxy as its own user. Either way we might be able to whitelist that traffic.
- Bind to a port in a range (e.g. 0 to 1023) outside the standard ephemeral range, then connect from that port to 169.254.169.254:80. A whitelist entry lets that range pass through. A bit more work, especially dealing with arcane socket stuff. See https://idea.popcount.org/2014-04-03-bind-before-connect/
- Run an additional bridge as you suggest
As to performance: the cloudprovider in Kubernetes fetches metadata fairly frequently. Even if it's not hitting the credential endpoint, it's still going to go through the proxy. It would be nice to expose stats on traffic levels (by endpoint, preferably, as well as roles, errors, etc.) so that administrators and developers can have a better idea of what's happening behind the scenes.