cni-ipvlan-vpc-k8s icon indicating copy to clipboard operation
cni-ipvlan-vpc-k8s copied to clipboard

Inconsistency between ENI allocated IPs and OS configuration

Open lbernail opened this issue 6 years ago • 16 comments

We are seeing an issue that seems to happens regularly: some pods have no network connectivity

After looking into the configuration it turns out that when this happens we are in the following situation:

  • pod sandbox configured properly (veth and ipvlan interfaces, as well as proper routing configurations)
  • IP of the pod not associated with the ENI so traffic is dropped by the VPC

After looking into logs we found the following:

  • cloudtrail shows a call to unassociate the IP address from the ENI (which seems to indicate that the CNI plugin was called with DELETE, but the routes and iptables rules are still there
  • the sandbox itself is not deleted. We found some errors in the kubelet logs, not sure it this is related:
failed to remove pod init container "consul-template": failed to get container status "371295090acf33795fe5badb07063021cace4fcff719cd13effc6ff2b5136f70": rpc error: code = Unknown desc = Error: No such container: 371295090acf33795fe5badb07063021cace4fcff719cd13effc6ff2b5136f70; Skipping pod "alerting-metric-evaluator-anomaly-0_datadog(4c15f7d2-5783-11e8-903a-02fc6d7aa9b8)"
  • kubelet tries to restart containers in the same sandbox (which fails because the pods have no network connectivity, which is required by the init container)

Any idea what could trigger this situation? Our current setup uses docker, kubelet 1.10 and the latest version of the CNI plugin.

I think SkipDeallocation could probably help but I'd like to understand exactly what is happening.

I wonder if allowing for more verbose logs could help in this kind of situation (for instance log ADD/DELETE calls with parameters)

lbernail avatar May 15 '18 13:05 lbernail

Just to confirm, since you're not using skipDeallocation, I assume you are also not using the "GC" mechanism?

I think we hit a case of this in a mass scale up ourselves this week, and I agree more logging around the actual plugin invocations would be beneficial here.

theatrus avatar May 15 '18 14:05 theatrus

I've been seeing the same behavior. I haven't checked the cloudtrail logs (for some reason, that idea hadn't occurred to me). This happens quite frequently for us though. For every 30-40 pods we launch, 2-3 of them get stuck in this state. @theatrus - are you able to share the configuration (cni config, OS, anything about # subnets, etc) that you run this in successfully?

gdearment avatar May 15 '18 14:05 gdearment

We run with skipDeallocation true and the GC stack cleaning up unused IPs every 5 minutes on a 15 minute expiration.

@paulnivin identified some race conditions which there is a PR in the works for free-IP detection which could be exasperated by not using skipDeallocation. When IPAM allocates a new IP address, its possible for a second IPAM run to grab that same IP address, which means both runs of the ipvlan plugin attempt to use the same address. Only one will succeed, and that same IP address will then be released by the teardown.

Using skipDeallocation avoids the issue of the IP being freed immediately (it will be marked in the file "registry" as free), which should side-step the issue until the racing patch can be added (the racing patch also uses the registry for this reason).

theatrus avatar May 15 '18 14:05 theatrus

I don't think the GC feature is documented in the README. Would it be possible for you to add it?

gdearment avatar May 15 '18 14:05 gdearment

@theatrus yes we are not using the gc mechanism (I only have seen references to it in the code and in the cli tool). We'll try with SkipDeallocation set to true

Just to understand better the race condition: when the CNI plugin is run at the same time for two pods, and grab the same address, one will fail (makes sense). What will trigger the teardown? An explicit delete from the runtime on the failed sandbox?

lbernail avatar May 15 '18 14:05 lbernail

@lbernail That is correct, as the IPvlan plugin will fail due to "address already in use" errors, the sandbox is then torn down. As both sandboxes "have" the same IP, the IP is released from AWS even though its actually in use elsewhere.

@gdearment Docs will be added shortly, I forgot to followup when adding the initial support here. The quick summary is, in order to free IPs, you need to run the -tool in a systemd timer (or cron). We run it every 5 minutes with the following unit:

[Unit]
Description=Garbage collect IPs unused for 15 minutes

[Service]
Type=oneshot
ExecStart=/usr/local/bin/cni-ipvlan-vpc-k8s-tool registry-gc --free-after=15m

Note that there is still a race condition possible here, but it should not lead to a fatal condition and can be safely retried. We're tracking a third, much rarer variant, where GC also releases an in-use IP but have only had one occurrence of this happen with stress testing.

theatrus avatar May 15 '18 14:05 theatrus

@theatrus Any risk in running with SkipDeallocation and no gc? Since IP addresses will be marked as free in the local registry it should not be an issue right, they will still be reused? (only limit is that we will allocate more private IPs on the ENI than needed)

lbernail avatar May 15 '18 16:05 lbernail

No issue, except for never returning IPs to Amazon until the node terminates.

theatrus avatar May 15 '18 17:05 theatrus

Thanks!

Should we have seen the "address already in use" error somewhere? I found nothing in the kubelet logs

lbernail avatar May 15 '18 17:05 lbernail

OK running with SkipDeallocation=True has really helped a lot (no issue related to this this week) I wonder if we could default this flag to True until the race condition has been fixed. What do you think?

lbernail avatar May 20 '18 08:05 lbernail

Hey @theatrus or @paulnivin - do you have any details or a pr to solve the race? We are running with SkipDeallocation but we still see periodic errors. They resolve automatically but they create event noise that complicates monitoring. 😞

roboll avatar Jun 18 '18 19:06 roboll

We're shipping a rc later this week that I'm hopeful will address this issue that you've been hitting -- this is part of a refactor in conjunction with us moving to k8s 1.10. Will keep you updated.

paulnivin avatar Jun 19 '18 18:06 paulnivin

That's great new thanks

lbernail avatar Jun 19 '18 20:06 lbernail

@lbernail Now that https://github.com/lyft/cni-ipvlan-vpc-k8s/pull/47 has been merged, can you give master a spin and see if this resolves the race you've been seeing?

paulnivin avatar Jun 22 '18 22:06 paulnivin

@paulnivin Sure. I'll just need to rebase our nodeport fix and to update our CNI configuration to move to chained execution. I'll probably have this deployed on a test cluster Monday. I'll let you know how it goes

lbernail avatar Jun 23 '18 09:06 lbernail

Initial testing looks good, we are going to deploy to a larger cluster

lbernail avatar Jun 25 '18 14:06 lbernail