Occasional flakiness of routerinterface-create-minimal on caracal
We're occasionally seeing flakiness in the kuttl test routerinterface-create-minimal on caracal.
I've attached the logs from a failing CI run which includes the problem.
The immediate cause is that we fail to delete the Subnet because it still has a port:
2025-04-29T13:53:23Z ERROR Reconciler error {"controller": "subnet", "controllerGroup": "openstack.k-orc.cloud", "controllerKind": "Subnet", "Subnet": {"name":"subnet-create-full-v6-gateway","namespace":"kuttl-test-internal-sawfly"}, "namespace": "kuttl-test-internal-sawfly", "name": "subnet-create-full-v6-gateway", "reconcileID": "f29215d6-9c49-4a73-832a-167b43b36e97", "error": "Expected HTTP response code [202 204] when accessing [DELETE http://10.1.0.25:9696/networking/v2.0/subnets/bf454990-cc17-4c3f-b3a4-cdf1a145d626], but got 409 instead: {\"NeutronError\": {\"type\": \"SubnetInUse\", \"message\": \"Unable to complete operation on subnet bf454990-cc17-4c3f-b3a4-cdf1a145d626: One or more ports have an IP allocation from this subnet.\", \"detail\": \"\"}}\nExpected HTTP response code [202 204] when accessing [DELETE http://10.1.0.25:9696/networking/v2.0/subnets/bf454990-cc17-4c3f-b3a4-cdf1a145d626], but got 409 instead: {\"NeutronError\": {\"type\": \"SubnetInUse\", \"message\": \"Unable to complete operation on subnet bf454990-cc17-4c3f-b3a4-cdf1a145d626: One or more ports have an IP allocation from this subnet.\", \"detail\": \"\"}}"}
Looking in the devstack logs (contained in the journal above) we can see that this port is the router interface port for this subnet. We can also see (below) that we removed this interface, which should have deleted the port, but we did this after the above error in the ORC logs attempting to delete the Subnet.
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: DEBUG neutron.tests.unit.plugins.ml2.drivers.mechanism_logger [req-fc5ff0e6-afac-4247-9866-0e38ec06231a req-76ad446f-7f80-4914-80ce-fa97bb381454 demo demo] delete_port_postcommit:
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[network_id], 0b9467ff-9ddc-4423-8504-235786c4ee96 -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[tenant_id], e1ba7fbf5dce4d9396e5f7b888e668f4 -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[revision_number], 2 -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[device_owner], network:router_interface -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[status], ACTIVE -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[allowed_address_pairs], [] -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[description], -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[updated_at], 2025-04-29T13:54:02Z -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[binding:profile], {} -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[port_security_enabled], False -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[name], -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[tags], [] -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[admin_state_up], True -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[device_id], c5e3931c-f257-428d-bf6e-f982772ee95f -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[project_id], e1ba7fbf5dce4d9396e5f7b888e668f4 -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[binding:vif_details], {} -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[fixed_ips], [{'subnet_id': 'bb4d9946-6045-4b23-9647-674b6cd13da0', 'ip_address': '192.168.155.1'}] -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[binding:vif_type], unbound -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[security_groups], [] -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[created_at], 2025-04-29T13:53:59Z -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[id], 339826e2-510a-4f10-a85d-4a36049da6c4 -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[binding:host_id], -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[extra_dhcp_opts], [] -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[binding:vnic_type], normal -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[mac_address], fa:16:3e:e4:fa:71 -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: key[standard_attr_id], 144 -> None
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: {{(pid=60603) _log_diff_call /opt/stack/neutron/neutron/tests/unit/plugins/ml2/drivers/mechanism_logger.py:59}}
Apr 29 13:54:12 fv-az1780-303 neutron-server[60603]: DEBUG neutron.api.rpc.handlers.resources_rpc [req-d683bc4e-4890-4273-b8be-0f20cc9436a1 req-092efd87-2bbe-4b3f-8988-fcce68225817 service neutron] Pushing event updated for resources: {'Port': ['ID=bcb47cb5-9694-4989-a5e9-c7d952fc22ea,revision_number=6']} {{(pid=60603) push /opt/stack/neutron/neutron/api/rpc/handlers/resources_rpc.py:237}}
...
Apr 29 13:54:13 fv-az1780-303 neutron-server[60603]: INFO neutron.wsgi [req-fc5ff0e6-afac-4247-9866-0e38ec06231a req-76ad446f-7f80-4914-80ce-fa97bb381454 demo demo] 172.18.0.2 "PUT /networking/v2.0/routers/c5e3931c-f257-428d-bf6e-f982772ee95f/remove_router_interface HTTP/1.1" status: 200 len: 484 time: 4.4341605
This could be a dependency bug in the Subnet controller.
In combination with https://github.com/k-orc/openstack-resource-controller/pull/375, if this happens again we should be able to tie the deletion of the RouterInterface in OpenStack to a reconcile id in ORC. This might give a further clue about what's going on here.
https://github.com/k-orc/openstack-resource-controller/pull/379 may also be useful.
Failed on dalmatian in https://github.com/k-orc/openstack-resource-controller/actions/runs/14793143091/job/41534425329?pr=377. I'll take a look at the failure next week, unless someone beats me to it.
I've looked at the failure, and I think the openstack logs reported above are unrelated. We got a conflict when deleting the subnet due to leftover router ports on it, because we have a transient dependency on the external subnet that is not represented in ORC. This a consequence, and not the cause.
Taking a look at this failed run for example, we have:
- the routerinterface waiting for an OpenStack resource to be deleted
- the imported router waiting for routerinterface finalizer to be removed
- the imported subnet waiting for routerinterface finalizer to be removed
The logs show the following events:
2025-05-02T16:48:17Z LEVEL(-3) Reconciling router interface delete {"controller": "router_interface", "controllerGroup": "openstack.k-orc.cloud", "controllerKind": "Router", "Router": {"name":"routerinterface-router-import","namespace":"kuttl-test-supreme-gar"}, "namespace": "kuttl-test-supreme-gar", "name": "routerinterface-router-import", "reconcileID": "4295542f-5f46-431b-854b-ac0664426788", "name": "routerinterface-create-minimal"}
2025-05-02T16:48:17Z LEVEL(-4) Will delete router interface {"controller": "router_interface", "controllerGroup": "openstack.k-orc.cloud", "controllerKind": "Router", "Router": {"name":"routerinterface-router-import","namespace":"kuttl-test-supreme-gar"}, "namespace": "kuttl-test-supreme-gar", "name": "routerinterface-router-import", "reconcileID": "4295542f-5f46-431b-854b-ac0664426788", "name": "routerinterface-create-minimal", "subnetID": "0fe85f50-250e-4187-ad4a-791220fa8d09", "portID": "77bfb537-582b-4309-9a01-35cb2a700b8b"}
2025-05-02T16:48:17Z LEVEL(-4) Deleting router interface {"controller": "router_interface", "controllerGroup": "openstack.k-orc.cloud", "controllerKind": "Router", "Router": {"name":"routerinterface-router-import","namespace":"kuttl-test-supreme-gar"}, "namespace": "kuttl-test-supreme-gar", "name": "routerinterface-router-import", "reconcileID": "4295542f-5f46-431b-854b-ac0664426788", "name": "routerinterface-create-minimal"}
[...]
2025-05-02T16:48:19Z LEVEL(-5) Reconcile done, requeueing after 1s {"controller": "router_interface", "controllerGroup": "openstack.k-orc.cloud", "controllerKind": "Router", "Router": {"name":"routerinterface-router-import","namespace":"kuttl-test-supreme-gar"}, "namespace": "kuttl-test-supreme-gar", "name": "routerinterface-router-import", "reconcileID": "4295542f-5f46-431b-854b-ac0664426788"}
[...]
Then on the next reconcile:
2025-05-02T16:48:20Z LEVEL(-3) Not reconciling interfaces for not-Available router {"controller": "router_interface", "controllerGroup": "openstack.k-orc.cloud", "controllerKind": "Router", "Router": {"name":"routerinterface-router-import","namespace":"kuttl-test-supreme-gar"}, "namespace": "kuttl-test-supreme-gar", "name": "routerinterface-router-import", "reconcileID": "39d6c48b-90e5-4b7e-8bd4-a689a15a2979"}
And then we're stuck as it won't reconcile again. We're hitting a race condition where the router (backing the imported router, for which we had a finalizer) was deleted in OpenStack between router interface reconciles. We can work around it in our tests by manually deleting the created router interface before letting kuttl mass delete the namespace.
That said, the whole test is weird and I'm not sure it makes sense at all. Should ORC really create router interfaces for unmanaged resources? it feels wrong.
Oh, nice sleuthing.
I think we do need the ability to add an interface to an unmanaged router. This handles the case where we are adding a managed subnet to a common, unmanaged router. This exact situation is why somebody contributed support for unmanaged routers in CAPO, for eg.
I think this might be an opportunity for robustification. I wonder if there's any way we can identify in RouterInterface that the Router is not available because it's gone. Perhaps we're being too aggressive filtering reconciles of RouterInterface?
@mdbooth I think https://github.com/k-orc/openstack-resource-controller/pull/423 will fix the deadlock situation. PTAL when you get a chance.