envoy icon indicating copy to clipboard operation
envoy copied to clipboard

IpVersions/WrrLocalityEdsIntegrationTest.AddRemoveLocality/1 test flake

Open ggreenway opened this issue 2 months ago • 10 comments

This occurred in CI on main (https://github.com/envoyproxy/envoy/actions/runs/18883242829/job/53891878709)

[ RUN      ] IpVersions/WrrLocalityEdsIntegrationTest.AddRemoveLocality/1
[2025-10-28 17:43:08.953][4380][critical][assert] [test/integration/http_integration.cc:573] assert failure: 0. Details: Timed out waiting for new connection.
Error: 0-28 17:43:08.953][4380][error][envoy_bug] [./source/common/common/assert.h:38] stacktrace for envoy bug
Error: 0-28 17:43:09.006][4380][error][envoy_bug] [./source/common/common/assert.h:43] #0 Envoy::HttpIntegrationTest::waitForNextUpstreamRequest() [0xaaaab45300b4]
Error: 0-28 17:43:09.009][4380][error][envoy_bug] [./source/common/common/assert.h:43] #1 Envoy::Extensions::LoadBalancingPolicies::WrrLocality::(anonymous namespace)::WrrLocalityEdsIntegrationTest::sendRequestsAndTrackUpstreamUsage() [0xaaaab45109f8]
Error: 0-28 17:43:09.011][4380][error][envoy_bug] [./source/common/common/assert.h:43] #2 Envoy::Extensions::LoadBalancingPolicies::WrrLocality::(anonymous namespace)::WrrLocalityEdsIntegrationTest_AddRemoveLocality_Test::TestBody() [0xaaaab45134bc]
Error: 0-28 17:43:09.014][4380][error][envoy_bug] [./source/common/common/assert.h:43] #3 testing::internal::HandleExceptionsInMethodIfSupported<>() [0xaaaab59fd628]
Error: 0-28 17:43:09.016][4380][error][envoy_bug] [./source/common/common/assert.h:43] #4 testing::Test::Run() [0xaaaab59fd4c4]
Error: 0-28 17:43:09.018][4380][error][envoy_bug] [./source/common/common/assert.h:43] #5 testing::TestInfo::Run() [0xaaaab59fe690]
Error: 0-28 17:43:09.020][4380][error][envoy_bug] [./source/common/common/assert.h:43] #6 testing::TestSuite::Run() [0xaaaab59ff990]
Error: 0-28 17:43:09.023][4380][error][envoy_bug] [./source/common/common/assert.h:43] #7 testing::internal::UnitTestImpl::RunAllTests() [0xaaaab5a0fd10]
Error: 0-28 17:43:09.025][4380][error][envoy_bug] [./source/common/common/assert.h:43] #8 testing::internal::HandleExceptionsInMethodIfSupported<>() [0xaaaab5a0f48c]
Error: 0-28 17:43:09.030][4380][error][envoy_bug] [./source/common/common/assert.h:43] #9 testing::UnitTest::Run() [0xaaaab5a0f298]
Error: 0-28 17:43:09.032][4380][error][envoy_bug] [./source/common/common/assert.h:43] #10 Envoy::TestRunner::runTests() [0xaaaab536ede4]
Error: 0-28 17:43:09.035][4380][error][envoy_bug] [./source/common/common/assert.h:43] #11 main [0xaaaab536dc8c]
Error: 0-28 17:43:09.035][4380][error][envoy_bug] [./source/common/common/assert.h:43] #12 __libc_start_main [0xffff88611e10]
[2025-10-28 17:43:09.035][4380][critical][backtrace] [./source/server/backtrace.h:129] Caught Aborted, suspect faulting address 0x6c0000111c
[2025-10-28 17:43:09.035][4380][critical][backtrace] [./source/server/backtrace.h:113] Backtrace (use tools/stack_decode.py to get line numbers):
[2025-10-28 17:43:09.035][4380][critical][backtrace] [./source/server/backtrace.h:114] Envoy version: 0/1.37.0-dev/test/RELEASE/BoringSSL
Execution result: https://mordenite.cluster.engflow.com/actions/executions/ChCNM6SiTitdNJMUanTHTSNWEgdkZWZhdWx0GiUKIJuD9kOZzkZPGhPoxBT2eUlDigRCog5CzDJTYWF4q9DGEKcD
================================================================================
INFO: From Testing //test/extensions/load_balancing_policies/wrr_locality:integration_test:
INFO: Found 1917 targets and 1524 test targets...
INFO: Elapsed time: 2152.269s, Critical Path: 1038.50s
INFO: 43890 processes: 17414 remote cache hit, 18448 internal, 1 local, 11 processwrapper-sandbox, 8016 remote.
INFO: Build completed, 1 test FAILED, 43890 total actions
//test/extensions/load_balancing_policies/wrr_locality:integration_test  FAILED in 58.3s

ggreenway avatar Oct 29 '25 15:10 ggreenway

Test was recently added in https://github.com/envoyproxy/envoy/pull/41689 by @efimki.

ggreenway avatar Oct 29 '25 15:10 ggreenway

cc @adisuissa i thought this was fixed by im seeing it still (testing in private repo)

phlax avatar Nov 26 '25 12:11 phlax

cc [@adisuissa](https://github.com/adisuissa) i thought this was fixed by im seeing it still (testing in private repo)

Thanks for the update. Can you paste the execution command and the entire log file?

adisuissa avatar Nov 26 '25 12:11 adisuissa

https://mordenite.cluster.engflow.com/invocations/default/5fb3647e-35e6-4ccd-9d6f-2084d127b4fe

https://mordenite.cluster.engflow.com/actions/executions/ChBwphAc8IZTJIgddP4dz-0bEgdkZWZhdWx0GiUKIIcSYTim0bKXZFBmmte_eg-C8xKQbWkZxD19Dg8MK8yzEJUD

phlax avatar Nov 26 '25 12:11 phlax

exit code is 134 - which is SIGABRT and may/not be an OOM

phlax avatar Nov 26 '25 12:11 phlax

The failure seems to be in ClientSideWeightedRoundRobinXdsIntegrationTest (which is different than this one). I've got a pending PR (#41878) that will address different tests in this file. After it is merged I'll continue working on a different set of tests and deflake them.

adisuissa avatar Nov 26 '25 13:11 adisuissa

heres a fail in this repo https://github.com/envoyproxy/envoy/actions/runs/19676719086/job/56359807168#step:21:650

again very slightly different test - but i think same cause

phlax avatar Nov 26 '25 14:11 phlax

The ClientSideWeightedRoundRobinXdsIntegrationTest is a different set of tests, that must be refactored.

adisuissa avatar Nov 26 '25 14:11 adisuissa

xds test is still failing (failed on the commit to fix eds tests)

https://github.com/envoyproxy/envoy/actions/runs/19800652450/job/56727441586#step:21:620

phlax avatar Dec 08 '25 10:12 phlax

AFAICT this failure is in IpVersions/ClientSideWeightedRoundRobinXdsIntegrationTest.TwoClusters and for some reason doesn't have symbols information:

[ RUN      ] IpVersions/ClientSideWeightedRoundRobinXdsIntegrationTest.TwoClusters/1
#0 [0x55bb370b7f00]
#1 __restore_rt [0x7f5799509420]
#2 [0x55bb32b4c1a0]
#3 [0x55bb32b4ba1e]
#4 [0x55bb32b4cd61]
#5 [0x55bb32b52ac0]
#6 [0x55bb32b52b30]
#7 [0x55bb329e9143]
#8 [0x55bb387c32ac]
#9 [0x55bb3879b20d]
#10 [0x55bb3877be5c]
#11 [0x55bb3877c975]
#12 [0x55bb3877d2f6]
#13 [0x55bb3878e26e]
#14 [0x55bb387c682c]
#15 [0x55bb3879e16d]
#16 [0x55bb3878d902]
#17 [0x55bb36a55c53]
#18 [0x55bb36a5508e]
#19 [0x55bb36a5298a]
#20 __libc_start_main [0x7f5799327083]
external/bazel_tools/tools/test/collect_coverage.sh: line 166:   858 Aborted                 (core dumped) "$@"
--

I'll try to reproduce this locally.

efimki avatar Dec 08 '25 16:12 efimki