calico Add ability to set MTU of Calico OpenStack nets

Description

Add an annotation to the etcd datamodel used to describe WorkloadEndpoints for Calico OpenStack VMs that specifies the MTU of the OpenStack network.

This effectively adds non-default (>1500) MTU support to Calico OpenStack.

Related issues/PRs

none

Todos

[ ] Tests
[ ] Documentation
[ ] Release note

Release Note

networking-calico and thus calico-dhcp-agent now respect the MTU property of OpenStack networks. Upon requesting a DHCP lease, an option for interface-mtu is provided that corresponds to that of the OpenStack network in which the port resides.

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

docs-pr-required: This change requires a change to the documentation that has not been completed yet.
docs-completed: This change has all necessary documentation completed.
docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

release-note-required: This PR has user-facing changes. Most PRs should have this label.
release-note-not-required: This PR has no user-facing changes.

Other optional labels:

cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

Aug 02 '22 15:08 tj90241

There are some issues with this as-is:

MTU is a network trait in OpenStack, but we're annotating WorkloadEndpoints, which are really more like ports. But without creating a new datamodel resource type, I'm not sure there is anywhere better for this.
(partially due to the above), an MTU change for a network follows an eventually consistent model by way of the resource syncer. The resource syncer has to enumerate over all WorkloadEndpoints upon an MTU change, which can take upwards of an hour or so on a large production cluster. During this time, it's also possible that dnsmasq is reconfigured several times if there are any on_endpoint_set events or anything that triggers asynchronous action on WorkloadEndpoints -- because the MTU is captured in the WorkloadEndpoints and they may all be in different states... it can lead to ping-ponging the MTU back and forth until all the WorkloadEndpoints have the same MTU. I do not have a clever way to overcome this limitation as of now, except for maybe only changing the MTU once in some arbitrary window of time or something... which seems equally hacky.

EDIT: but what if the MTU is only allowed to change once per snapshot or compaction cycle? That should be pretty stable and play well with the resource synced.

Aug 02 '22 15:08 tj90241

There is a really nasty side effect with, at least, Cirros images here... the default route is not making it to the instance anymore if the instance's taps are at 1500 MTU and the OpenStack network is raised to 9000 MTU. You can reproduce this by e.g.:

Creating a network with 1500 MTU
Launching instances
Changing the network to 9000 MTU
Observing that dnsmasq restarts, the VMs are offered interface-mtu option of 9k
Instances still get a lease but lack a default route

One must also use this option with care - it's not possible to live-migrate VMs once the network MTU changes wrt how they are presently configured... libvirt/qemu abort out at the end of migration and fail to move the VM. One must hard reboot all VMs in the network to get them to observe a change.

Live Migration failure: unsupported configuration:
Target network card MTU 9000 does not match source 1500:
libvirt.libvirtError: unsupported configuration:
Target network card MTU 9000 does not match source 1500

Aug 02 '22 20:08 tj90241

I'm a bit baffled as I don't understand what changed, but the review comments addressed the issue with the instances no longer getting a default route in Cirros when changing the MTU on instances which were already built.

I also noticed that dhcp_agent was doing a try/except to import neutron_constants for different OpenStack releases, but this mechanism already existed in the compat module. So I removed the try/except and just imported constants from the compat module in the latest patch.

Aug 03 '22 22:08 tj90241

/sem-approve

Aug 05 '22 08:08 nelljerram

whoops, forgot to update the unit tests... will do that today.

Rather, I think the result of the changes here is that when an endpoint is created or changed in some other way, it will then pick up the current MTU for its network. Is that right?

Certainly, new ports created immediately after an MTU change work exactly as you say. Added a more elaborate example of my concern below.

I suppose it could be that Neutron sends an update for all ports to the ML2 driver, when a network property is changed, but I didn't think it did that,

It does not -- perhaps I didn't explain myself well, or there's some aspect of etcd that I'm missing here. The challenge is unique to Calico in this prospective:

Let's suppose we change the MTU of the OpenStack network that has a large number of pre-existing ports. As per above, any new ports created immediately after this change work just like we expect. However, the next time that the ResourceSyncer runs, it will observe that the WorkloadEndpoints for existing ports all change as a result of the MTU annotation itself being added or changed, and begin rewriting each WorkloadEndpoint serially, one at a time.

When there are thousands or more of ports/WorkloadEndpoints in the cluster, this operation takes a very long time to complete. During this time, I expect that e.g., if a calico-dhcp-agent is bounced, it would observe that the MTU of some WorkloadEndpoints is set to a stale value, and for others it is the "current" value. As such, I don't think the MTU value that we update dnsmasq with would be deterministic -- it's the last port that gets processed by calico-dhcp-agent when _update_dnsmasq is called which determines the MTU used -- this may be a stale value?

It's quite likely that during this scenario, the ResourceSyncer will soon stumble on another port which is on the same host and again cause dnsmasq to be reloaded, I think, though -- so it's not a terribly huge concern. And, of course eventually, the ResourceSyncer will enumerate over all WorkloadEndpoints. However, an operator may not be aware of the above and can be caught by surprise.

For us, this is perfectly acceptable.

If my thinking above is right: Yes, I think this should be documented somewhere. Additionally, in the future, adding some kind of resource to the datamodel that represents an OpenStack network where MTU could be captured would make this slightly prickly in that it would allow the ResourceSyncer to make an update to a single resource, which all calico-dhcp-agents could observe at the same time.

Secondly, the issue with live migrating when the network MTU has changed but the endpoint not yet updated. Do you think this is a point that you can live with?

Yes -- on this front, I'm working on a Nova patch that retains the MTU in the libvirt XML during migration instead of it adapting to whatever the network currently has. Hopefully upstream will accept it -- if not, we'll still make use of it, but others may stumble upon this.

Finally the issue with CirrOS - I suppose you'll monitor to check that it really has gone permanently away now. In any case it feels like something we could dig into further if needed.

Yes, I've begun putting this change on our test clusters and we intend to begin phasing this feature in imminently as well. I do not see it being a problem now and will be prepared to deal with it if it comes up.

Aug 05 '22 13:08 tj90241

@neiljerram - could you please have a look at the CI tests? Looking closer now, a lot of the errors seem unrelated...

On another note, I have some further updates:

I was able to reproduce the aforementioned Cirros issue again. It occurs under a unique set of circumstances that can only hold true if one makes code changes to nova (see next bullet)...
Upstream does not support live-migration of a libvirt-hosted instance if any source VIF has an MTU which does not match that of the Neutron network (or, if the MTU of the VIF would change during live migration). They have actually turned down the patch I pushed which would have allowed for the instance to still migrate with a MTU that does not match the Neutron network. We still have a way to do this internally I think, but it goes outside of the scope of the code changes here.
However, we probably need a release note that says one cannot live-migrate instances without hard rebooting them if and when the associated Neutron network MTU is changed. A future version of nova will hopefully have an explicit error on this, ref this WIP PS: https://review.opendev.org/c/openstack/nova/+/852367/1

Aug 09 '22 19:08 tj90241

Thanks @tj90241 for the above notes. Of course you are right about traversing all endpoints during a resync, and I'm sorry that I forgot about that!

It looks like there are two non-trivial next steps here:

I think we should look at passing MTU to the DHCP agent by implementing the network hooks in the ML2 driver and adding a network object to the datamodel. That would allow an MTU change to be signaled immediately to the agent, and without having to update all of the ports on the network.
As you've noted, there appear to be unrelated problems in the CI - perhaps we haven't fully pinned the OpenStack code that we are testing against?

I'd like to dig into these more, but it's a matter of scheduling, so please let's have a (parallel) conversation about that...

Aug 11 '22 09:08 nelljerram