gateway icon indicating copy to clipboard operation
gateway copied to clipboard

flaky: FAIL TestE2E/OIDC http_route_with_oidc_authentication

Open shawnh2 opened this issue 1 year ago • 4 comments

ref https://github.com/envoyproxy/gateway/actions/runs/10002832453/job/27649851258#step:6:56104

shawnh2 avatar Jul 19 '24 06:07 shawnh2

cc @zhaohuabing

shawnh2 avatar Jul 20 '24 06:07 shawnh2

@zhaohuabing https://github.com/envoyproxy/gateway/actions/runs/10035889940/job/27733139881?pr=3925

zirain avatar Jul 22 '24 09:07 zirain

another one, https://github.com/envoyproxy/gateway/actions/runs/10041716495/job/27750963358?pr=3929

zirain avatar Jul 22 '24 14:07 zirain

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Aug 21 '24 16:08 github-actions[bot]

Now it's the top 1 flaky.

zirain avatar Sep 13 '24 00:09 zirain

Looks like it's a xDS sequence issue. Sometimes, the clusters come before the listeners.

We had this issue before but it has been fixed in https://github.com/envoyproxy/go-control-plane/pull/752 and https://github.com/envoyproxy/go-control-plane/pull/801. However, it seems like there're still some edge cases that have not been covered.

keycloak_gateway-conformance-infra_80 was received at "2024-09-30T03:38:17.438Z", and the listener gateway-conformance-infra/same-namespace/http with the OAuth2 filter was received ast "2024-09-30T03:38:17.433Z",

{
          "version_info": "128f48574c7597a14171b86d0843a3b9e4260fa85903ac200b688f95b485cffc",
          "cluster": {
           "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
           "name": "keycloak_gateway-conformance-infra_80",
            .....     
          "last_updated": "2024-09-30T03:38:17.438Z"
         },

         ....
         "error_state": {
           "failed_configuration": {
            "@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
            "name": "gateway-conformance-infra/same-namespace/http",
         "last_update_attempt": "2024-09-30T03:38:17.433Z",
           "details": "OAuth2 filter: unknown cluster 'keycloak_gateway-conformance-infra_80' in config. Please specify which cluster to direct OAuth requests to."
          

Full log: https://productionresultssa12.blob.core.windows.net/actions-results/0755865f-f088-4383-a22d-ac6aff0f6840/workflow-job-run-5846ad87-d8c0-5b26-6b06-bd348efdf943/logs/job/job-logs.txt?rsct=text%2Fplain&se=2024-09-30T04%3A20%3A04Z&sig=ZvSAbYbTBuaOqRsHjjBABxDkcTjPt9pRlhuym7JmaAw%3D&ske=2024-09-30T15%3A31%3A14Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2024-09-30T03%3A31%3A14Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2024-08-04&sp=r&spr=https&sr=b&st=2024-09-30T04%3A09%3A59Z&sv=2024-08-04

zhaohuabing avatar Sep 30 '24 04:09 zhaohuabing

2024-10-29T03:31:25.1473630Z         [2024-10-29 03:25:51.043][1][warning][config] [source/extensions/config_subscription/grpc/delta_subscription_state.cc:276] delta config for type.googleapis.com/envoy.config.listener.v3.Listener rejected: Error adding/updating listener(s) gateway-conformance-infra/same-namespace/http: OAuth2 filter: unknown cluster 'keycloak_gateway-conformance-infra_80' in config. Please specify which cluster to direct OAuth requests to.
2024-10-29T03:31:25.1473940Z         
2024-10-29T03:31:25.1478408Z         [2024-10-29 03:25:51.043][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138] gRPC config for type.googleapis.com/envoy.config.listener.v3.Listener rejected: Error adding/updating listener(s) gateway-conformance-infra/same-namespace/http: OAuth2 filter: unknown cluster 'keycloak_gateway-conformance-infra_80' in config. Please specify which cluster to direct OAuth requests to.
2024-10-29T03:31:25.1478528Z         
2024-10-29T03:31:25.1478738Z         LISTENER ACCESS LOG - 0
2024-10-29T03:31:25.1478970Z         LISTENER ACCESS LOG - 0
2024-10-29T03:31:25.1480113Z         [2024-10-29 03:26:53.838][1][warning][http] [source/common/http/async_client_impl.cc:256] the buffer size limit (64KB) for async client retries has been exceeded.
2024-10-29T03:31:25.1481467Z         [2024-10-29 03:28:13.845][1][warning][http] [source/common/http/async_client_impl.cc:256] the buffer size limit (64KB) for async client retries has been exceeded.
2024-10-29T03:31:25.1483395Z         [2024-10-29 03:30:53.849][1][warning][http] [source/common/http/async_client_impl.cc:256] the buffer size limit (64KB) for async client retries has been exceeded.

zirain avatar Oct 29 '24 03:10 zirain

Yes, sometimes the cluter arrives behind the listener configuration, that's why we got this error.

zhaohuabing avatar Oct 29 '24 04:10 zhaohuabing

@zhaohuabing is this because we moved to the backendRef model and the resource may get reconciled later ? should we skip creating the filter until the IR destination is populated ?

arkodg avatar Oct 29 '24 18:10 arkodg

@arkodg I remember this flaky existed before the backenRef model. I did some debugging in the go control plane and found out that actually both the clusters and listeners were in the xDS cache when this error happened, but sometimes the ACK of the previous cluster response came later, causing the clusters not being sent to the client while listeners were. Looks like a bug in go-control-plane delta xDS implementation.

zhaohuabing avatar Oct 30 '24 01:10 zhaohuabing

Found an issue in go-control-plane: https://github.com/envoyproxy/go-control-plane/issues/448

zhaohuabing avatar Mar 14 '25 06:03 zhaohuabing

@zhaohuabing can you share the full configuration of keycloak_gateway-conformance-infra_80 cluster? (the full log link shared requires permission, and I cannot open it) Ordering of ADS resources (https://github.com/envoyproxy/go-control-plane/pull/752) does not guarantee that a cluster is warmed up before a listener is if warming up the cluster requires asynchronous operations (e.g. resolving DNS names, or fetching EDS). Consequently a listener may be initialized before the cluster is ready even when go-control-plane sends the listener configuration after the cluster configuration.

jparklab avatar Mar 14 '25 17:03 jparklab

@jparklab the log has been deleted. There is a similar issue in JWT auth: https://github.com/envoyproxy/gateway/issues/4791

zhaohuabing avatar May 27 '25 03:05 zhaohuabing

Ordering of ADS resources (https://github.com/envoyproxy/go-control-plane/pull/752) does not guarantee that a cluster is warmed up before a listener is if warming up the cluster requires asynchronous operations (e.g. resolving DNS names, or fetching EDS). Consequently a listener may be initialized before the cluster is ready even when go-control-plane sends the listener configuration after the cluster configuration.

It sounds an Envoy bug to me, if the cluster warming up is asynchronous, then Envoy should not report errors and reject the listeners referencing a missing cluster when it receives xDS listeners, instead, it should check the cluster when a request is sent.

zhaohuabing avatar May 27 '25 04:05 zhaohuabing

hey @zhaohuabing can we close this, since it was fixed with https://github.com/envoyproxy/envoy/issues/40735 ? thanks for fixing this in upstream !

arkodg avatar Sep 10 '25 01:09 arkodg