flaky: FAIL TestE2E/OIDC http_route_with_oidc_authentication
ref https://github.com/envoyproxy/gateway/actions/runs/10002832453/job/27649851258#step:6:56104
cc @zhaohuabing
@zhaohuabing https://github.com/envoyproxy/gateway/actions/runs/10035889940/job/27733139881?pr=3925
another one, https://github.com/envoyproxy/gateway/actions/runs/10041716495/job/27750963358?pr=3929
This issue has been automatically marked as stale because it has not had activity in the last 30 days.
Now it's the top 1 flaky.
Looks like it's a xDS sequence issue. Sometimes, the clusters come before the listeners.
We had this issue before but it has been fixed in https://github.com/envoyproxy/go-control-plane/pull/752 and https://github.com/envoyproxy/go-control-plane/pull/801. However, it seems like there're still some edge cases that have not been covered.
keycloak_gateway-conformance-infra_80 was received at "2024-09-30T03:38:17.438Z", and the listener gateway-conformance-infra/same-namespace/http with the OAuth2 filter was received ast "2024-09-30T03:38:17.433Z",
{
"version_info": "128f48574c7597a14171b86d0843a3b9e4260fa85903ac200b688f95b485cffc",
"cluster": {
"@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
"name": "keycloak_gateway-conformance-infra_80",
.....
"last_updated": "2024-09-30T03:38:17.438Z"
},
....
"error_state": {
"failed_configuration": {
"@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
"name": "gateway-conformance-infra/same-namespace/http",
"last_update_attempt": "2024-09-30T03:38:17.433Z",
"details": "OAuth2 filter: unknown cluster 'keycloak_gateway-conformance-infra_80' in config. Please specify which cluster to direct OAuth requests to."
Full log: https://productionresultssa12.blob.core.windows.net/actions-results/0755865f-f088-4383-a22d-ac6aff0f6840/workflow-job-run-5846ad87-d8c0-5b26-6b06-bd348efdf943/logs/job/job-logs.txt?rsct=text%2Fplain&se=2024-09-30T04%3A20%3A04Z&sig=ZvSAbYbTBuaOqRsHjjBABxDkcTjPt9pRlhuym7JmaAw%3D&ske=2024-09-30T15%3A31%3A14Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2024-09-30T03%3A31%3A14Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2024-08-04&sp=r&spr=https&sr=b&st=2024-09-30T04%3A09%3A59Z&sv=2024-08-04
2024-10-29T03:31:25.1473630Z [2024-10-29 03:25:51.043][1][warning][config] [source/extensions/config_subscription/grpc/delta_subscription_state.cc:276] delta config for type.googleapis.com/envoy.config.listener.v3.Listener rejected: Error adding/updating listener(s) gateway-conformance-infra/same-namespace/http: OAuth2 filter: unknown cluster 'keycloak_gateway-conformance-infra_80' in config. Please specify which cluster to direct OAuth requests to.
2024-10-29T03:31:25.1473940Z
2024-10-29T03:31:25.1478408Z [2024-10-29 03:25:51.043][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138] gRPC config for type.googleapis.com/envoy.config.listener.v3.Listener rejected: Error adding/updating listener(s) gateway-conformance-infra/same-namespace/http: OAuth2 filter: unknown cluster 'keycloak_gateway-conformance-infra_80' in config. Please specify which cluster to direct OAuth requests to.
2024-10-29T03:31:25.1478528Z
2024-10-29T03:31:25.1478738Z LISTENER ACCESS LOG - 0
2024-10-29T03:31:25.1478970Z LISTENER ACCESS LOG - 0
2024-10-29T03:31:25.1480113Z [2024-10-29 03:26:53.838][1][warning][http] [source/common/http/async_client_impl.cc:256] the buffer size limit (64KB) for async client retries has been exceeded.
2024-10-29T03:31:25.1481467Z [2024-10-29 03:28:13.845][1][warning][http] [source/common/http/async_client_impl.cc:256] the buffer size limit (64KB) for async client retries has been exceeded.
2024-10-29T03:31:25.1483395Z [2024-10-29 03:30:53.849][1][warning][http] [source/common/http/async_client_impl.cc:256] the buffer size limit (64KB) for async client retries has been exceeded.
Yes, sometimes the cluter arrives behind the listener configuration, that's why we got this error.
@zhaohuabing is this because we moved to the backendRef model and the resource may get reconciled later ?
should we skip creating the filter until the IR destination is populated ?
@arkodg I remember this flaky existed before the backenRef model.
I did some debugging in the go control plane and found out that actually both the clusters and listeners were in the xDS cache when this error happened, but sometimes the ACK of the previous cluster response came later, causing the clusters not being sent to the client while listeners were. Looks like a bug in go-control-plane delta xDS implementation.
Found an issue in go-control-plane: https://github.com/envoyproxy/go-control-plane/issues/448
@zhaohuabing can you share the full configuration of keycloak_gateway-conformance-infra_80 cluster? (the full log link shared requires permission, and I cannot open it)
Ordering of ADS resources (https://github.com/envoyproxy/go-control-plane/pull/752) does not guarantee that a cluster is warmed up before a listener is if warming up the cluster requires asynchronous operations (e.g. resolving DNS names, or fetching EDS). Consequently a listener may be initialized before the cluster is ready even when go-control-plane sends the listener configuration after the cluster configuration.
@jparklab the log has been deleted. There is a similar issue in JWT auth: https://github.com/envoyproxy/gateway/issues/4791
Ordering of ADS resources (https://github.com/envoyproxy/go-control-plane/pull/752) does not guarantee that a cluster is warmed up before a listener is if warming up the cluster requires asynchronous operations (e.g. resolving DNS names, or fetching EDS). Consequently a listener may be initialized before the cluster is ready even when go-control-plane sends the listener configuration after the cluster configuration.
It sounds an Envoy bug to me, if the cluster warming up is asynchronous, then Envoy should not report errors and reject the listeners referencing a missing cluster when it receives xDS listeners, instead, it should check the cluster when a request is sent.
hey @zhaohuabing can we close this, since it was fixed with https://github.com/envoyproxy/envoy/issues/40735 ? thanks for fixing this in upstream !