envoy
envoy copied to clipboard
ext_proc: gcc test flakes: IpVersionsClientTypeDeferredProcessing/ExtProcIntegrationTest.GetAndSetTrailersIncorrectlyOnResponse/IPv6_GoogleGrpc_NoDeferredProcessing
This flaked for me in a CI: https://dev.azure.com/cncf/envoy/_build/results?buildId=165561&view=logs&j=8bf29878-a4cc-50f7-4e84-2255e6fd4065&t=150a3762-112b-5713-0d28-af1eb6c2fbf4
2024-03-20T19:10:36.7571536Z [ RUN ] IpVersionsClientTypeDeferredProcessing/ExtProcIntegrationTest.GetAndSetTrailersIncorrectlyOnResponse/IPv6_GoogleGrpc_NoDeferredProcessing
2024-03-20T19:10:36.7572847Z test/extensions/filters/http/ext_proc/ext_proc_integration_test.cc:252: Failure
2024-03-20T19:10:36.7573591Z Value of: response.waitForEndStream()
2024-03-20T19:10:36.7574178Z Actual: false (Timed out waiting for end stream
2024-03-20T19:10:36.7574832Z )
2024-03-20T19:10:36.7575165Z Expected: true
2024-03-20T19:10:36.7575524Z Stack trace:
2024-03-20T19:10:36.7576086Z 0x5594d9350370: Envoy::ExtProcIntegrationTest::verifyDownstreamResponse()
2024-03-20T19:10:36.7577124Z 0x5594d92c450e: Envoy::ExtProcIntegrationTest_GetAndSetTrailersIncorrectlyOnResponse_Test::TestBody()
2024-03-20T19:10:36.7578166Z 0x5594dd083bed: testing::internal::HandleSehExceptionsInMethodIfSupported<>()
2024-03-20T19:10:36.7579047Z 0x5594dd07e89d: testing::internal::HandleExceptionsInMethodIfSupported<>()
2024-03-20T19:10:36.7579736Z 0x5594dd065128: testing::Test::Run()
2024-03-20T19:10:36.7580250Z 0x5594dd065b4b: testing::TestInfo::Run()
2024-03-20T19:10:36.7580762Z ... Google Test internal frames ...
2024-03-20T19:10:36.7581046Z
2024-03-20T19:10:36.7582221Z [external/com_google_absl/absl/flags/internal/flag.cc : 113] RAW: Restore saved value of envoy_reloadable_features_defer_processing_backedup_streams to: true
2024-03-20T19:10:36.7583968Z [external/com_google_absl/absl/flags/internal/flag.cc : 113] RAW: Restore saved value of envoy_reloadable_features_no_extension_lookup_by_name to: true
2024-03-20T19:10:36.7585663Z [external/com_google_absl/absl/flags/internal/flag.cc : 113] RAW: Restore saved value of envoy_quic_always_support_server_preferred_address to: true
2024-03-20T19:10:36.7613937Z [external/com_google_absl/absl/flags/internal/flag.cc : 113] RAW: Restore saved value of envoy_reloadable_features_immediate_response_use_filter_mutation_rule to: true
2024-03-20T19:10:36.7616082Z [external/com_google_absl/absl/flags/internal/flag.cc : 113] RAW: Restore saved value of envoy_reloadable_features_send_header_raw_value to: true
2024-03-20T19:10:36.7617753Z [external/com_google_absl/absl/flags/internal/flag.cc : 113] RAW: Restore saved value of envoy_reloadable_features_runtime_initialized to: false
2024-03-20T19:10:36.7619474Z [external/com_google_absl/absl/flags/internal/flag.cc : 113] RAW: Restore saved value of envoy_reloadable_features_defer_processing_backedup_streams to: true
2024-03-20T19:10:36.7621225Z [external/com_google_absl/absl/flags/internal/flag.cc : 113] RAW: Restore saved value of envoy_reloadable_features_no_extension_lookup_by_name to: true
2024-03-20T19:10:36.7622934Z [external/com_google_absl/absl/flags/internal/flag.cc : 113] RAW: Restore saved value of envoy_quic_always_support_server_preferred_address to: true
2024-03-20T19:10:36.7624732Z [external/com_google_absl/absl/flags/internal/flag.cc : 113] RAW: Restore saved value of envoy_reloadable_features_immediate_response_use_filter_mutation_rule to: true
2024-03-20T19:10:36.7626586Z [external/com_google_absl/absl/flags/internal/flag.cc : 113] RAW: Restore saved value of envoy_reloadable_features_send_header_raw_value to: true
2024-03-20T19:10:36.7628342Z [external/com_google_absl/absl/flags/internal/flag.cc : 113] RAW: Restore saved value of envoy_reloadable_features_runtime_initialized to: false
2024-03-20T19:10:36.7630916Z [ FAILED ] IpVersionsClientTypeDeferredProcessing/ExtProcIntegrationTest.GetAndSetTrailersIncorrectlyOnResponse/IPv6_GoogleGrpc_NoDeferredProcessing, where GetParam() = ((4-byte object <01-00 00-00>, 4-byte object <01-00 00-00>), false) (10547 ms)
Response timeouts. This test is flaky. Retest should be able to get the CI pass.
Overall, ext_proc_integration_test.cc is growing too big. We probably should start to convert some of the test cases to unit test and move them to filter_test.cc. @tyxia @stevenzzzz
We should not consider this state to be acceptable. We should use simulated time in tests, where possible, to make tests not be brittle to time. If that is impossible we should allow much more generous timeouts.
I am thinking we probably can first try to use timeout value larger than TestUtility::DefaultTimeout for the trailer case, to see if it helps to de-flake.
Regarding converting to unit test. I personally like integration test (and over unit test :)), as it tests the code in a closer-to-production and more E2E way. If we have concern with its size, we probably can refactor it into multiple integration test.
Another case: https://dev.azure.com/cncf/envoy/_build/results?buildId=168324&view=logs&j=8c169225-0ae8-53bd-947f-07cb81846cb5&t=d1a98671-b7ba-5fbf-f06c-ff337c010df4&l=1021
ext_proc has side stream process which adds round trip latency for a request. This probably is the reason why these ext_proc tests are frequently flaky due to timeout. As a first remedy, we can certainly increase the timeout to reduce the probability of the flakiness
Long term, for tests only need to verify ext_proc filter behavior/state machine, actually filter_test.cc: one example, https://github.com/envoyproxy/envoy/blob/313b6fb7cf0f806e74a2d42c93e7c1fcccce2391/test/extensions/filters/http/ext_proc/filter_test.cc#L587 can test most of them. Tests in this file using simulated timer. We very rarely see flakiness in this test.
@tyxia yes, integration tests is end-to-end and close to production. The draw back is it has to create an Envoy instance, which is slow and expensive for tests not necessarily need it.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.
Still happening (in tsan). https://github.com/envoyproxy/envoy/actions/runs/14583047690/job/40903497197
[ RUN ] IpVersionsClientTypeDeferredProcessing/ExtProcIntegrationTest.GetAndCloseStream/3
test/integration/fake_upstream.cc:812: Failure
Failed
Please don't waitForHttpConnection with a 5s timeout if failure is expected
Stack trace:
0x55a5c5965c38: Envoy::FakeUpstream::waitForHttpConnection()
0x55a5c40f6c1b: Envoy::ExtProcIntegrationTest::handleUpstreamRequest()
0x55a5c40f52e2: Envoy::ExtProcIntegrationTest::testGetAndCloseStream()
0x55a5c3ec0130: Envoy::ExtProcIntegrationTest_GetAndCloseStream_Test::TestBody()
0x55a5c3ec01dd: Envoy::ExtProcIntegrationTest_GetAndCloseStream_Test::TestBody()
0x55a5cce1406d: testing::internal::HandleSehExceptionsInMethodIfSupported<>()
0x55a5ccde571b: testing::internal::HandleExceptionsInMethodIfSupported<>()
0x55a5ccdc9c69: testing::Test::Run()
0x55a5ccdcaab2: testing::TestInfo::Run()
... Google Test internal frames ...
test/extensions/filters/http/ext_proc/ext_proc_integration_test.cc:340: Failure
Value of: fake_upstreams_[0]->waitForHttpConnection(*dispatcher_, fake_upstream_connection_)
Actual: false (Timed out waiting for new connection.)
Expected: true