envoy icon indicating copy to clipboard operation
envoy copied to clipboard

debug assertion failure: `assert failure: isIdleImpl()`

Open howardjohn opened this issue 2 years ago • 4 comments

Title: debug assertion failure: assert failure: isIdleImpl()

Description: In our CI with debug builds, we are regularly seeing this assertion failure.

2022-08-04T22:11:20.554576Z	critical	envoy assert	assert failure: isIdleImpl().
2022-08-04T22:11:20.554714Z	critical	envoy backtrace	Caught Aborted, suspect faulting address 0x53900000018
2022-08-04T22:11:20.554729Z	critical	envoy backtrace	Backtrace (use tools/stack_decode.py to get line numbers):
2022-08-04T22:11:20.554739Z	critical	envoy backtrace	Envoy version: 01baf88a8942c273181c830b4307bfe70bd839c0/1.24.0-dev/Clean/DEBUG/BoringSSL
2022-08-04T22:11:20.646444Z	critical	envoy backtrace	#0: Envoy::SignalAction::sigHandler() [0x55787082aedc]
2022-08-04T22:11:20.646627Z	critical	envoy backtrace	#1: __restore_rt [0x7fe4307ab140]
2022-08-04T22:11:20.729793Z	critical	envoy backtrace	#2: Envoy::Tcp::ConnPoolImpl::deleteIsPending() [0x55786ee5e6e5]
2022-08-04T22:11:20.816228Z	critical	envoy backtrace	#3: Envoy::Event::DispatcherImpl::deferredDelete() [0x557870440645]
2022-08-04T22:11:20.901629Z	critical	envoy backtrace	#4: Envoy::Upstream::ClusterManagerImpl::ThreadLocalClusterManagerImpl::tcpConnPoolIsIdle() [0x55786edd54b1]
2022-08-04T22:11:20.994375Z	critical	envoy backtrace	#5: Envoy::Upstream::ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry::tcpConnPoolImpl()::$_24::operator()() [0x55786ee05891]
2022-08-04T22:11:21.111388Z	critical	envoy backtrace	#6: std::__1::__invoke<>() [0x55786ee0582d]
2022-08-04T22:11:21.222222Z	critical	envoy backtrace	#7: std::__1::__invoke_void_return_wrapper<>::__call<>() [0x55786ee057dd]
[2022-08-04T22:11:20.230Z] "- - -" 0 - - - "-" 1419 3114 320 - "-" "-" "-" "-" "10.244.0.26:18443" inbound|18443|| 127.0.0.6:38235 10.244.0.26:18443 10.244.0.33:37728 outbound_.18443_._.headless.echo1-3-39127.svc.cluster.local -
[2022-08-04T22:11:20.230Z] "- - -" 0 - - - "-" 1419 3114 320 - "-" "-" "-" "-" "10.244.0.26:18443" inbound|18443|| 127.0.0.6:46235 10.244.0.26:18443 10.244.0.33:37732 outbound_.18443_._.headless.echo1-3-39127.svc.cluster.local -
[2022-08-04T22:11:20.230Z] "- - -" 0 - - - "-" 1419 3114 322 - "-" "-" "-" "-" "10.244.0.26:18443" inbound|18443|| 127.0.0.6:60737 10.244.0.26:18443 10.244.0.33:37730 outbound_.18443_._.headless.echo1-3-39127.svc.cluster.local -
[2022-08-04T22:11:20.552Z] "- - -" 0 - - - "-" 1419 3114 12 - "-" "-" "-" "-" "10.244.0.26:18443" inbound|18443|| 127.0.0.6:56849 10.244.0.26:18443 10.244.0.33:37774 outbound_.18443_._.headless.echo1-3-39127.svc.cluster.local -
2022-08-04T22:11:21.334186Z	critical	envoy backtrace	#8: std::__1::__function::__alloc_func<>::operator()() [0x55786ee057ad]
2022-08-04T22:11:21.446329Z	critical	envoy backtrace	#9: std::__1::__function::__func<>::operator()() [0x55786ee04969]
2022-08-04T22:11:21.588074Z	critical	envoy backtrace	#10: std::__1::__function::__value_func<>::operator()() [0x557869d78252]
2022-08-04T22:11:21.707729Z	critical	envoy backtrace	#11: std::__1::function<>::operator()() [0x557869d78215]
2022-08-04T22:11:21.837349Z	critical	envoy backtrace	#12: Envoy::ConnectionPool::ConnPoolImplBase::checkForIdleAndNotify() [0x55786f362fcd]
2022-08-04T22:11:21.954837Z	critical	envoy backtrace	#13: Envoy::ConnectionPool::ConnPoolImplBase::checkForIdleAndCloseIdleConnsIfDraining() [0x55786f362c73]
2022-08-04T22:11:22.071136Z	critical	envoy backtrace	#14: Envoy::Tcp::ActiveTcpClient::~ActiveTcpClient() [0x55786f321016]
2022-08-04T22:11:22.211542Z	critical	envoy backtrace	#15: Envoy::Tcp::ActiveTcpClient::~ActiveTcpClient() [0x55786f3210f9]
2022-08-04T22:11:22.365065Z	critical	envoy backtrace	#16: std::__1::default_delete<>::operator()() [0x55786a10e6bc]
2022-08-04T22:11:22.489220Z	critical	envoy backtrace	#17: std::__1::unique_ptr<>::reset() [0x55786a10e63c]
2022-08-04T22:11:22.610566Z	critical	envoy backtrace	#18: Envoy::Event::DispatcherImpl::clearDeferredDeleteList() [0x55787043ca34]
2022-08-04T22:11:22.727033Z	critical	envoy backtrace	#19: Envoy::Event::DispatcherImpl::DispatcherImpl()::$_2::operator()() [0x5578704487e9]
2022-08-04T22:11:22.841456Z	critical	envoy backtrace	#20: std::__1::__invoke<>() [0x5578704487ad]
2022-08-04T22:11:22.949119Z	critical	envoy backtrace	#21: std::__1::__invoke_void_return_wrapper<>::__call<>() [0x55787044875d]
2022-08-04T22:11:23.038542Z	critical	envoy backtrace	#22: std::__1::__function::__alloc_func<>::operator()() [0x55787044872d]
2022-08-04T22:11:23.132226Z	critical	envoy backtrace	#23: std::__1::__function::__func<>::operator()() [0x557870447879]
2022-08-04T22:11:23.132367Z	critical	envoy backtrace	#24: std::__1::__function::__value_func<>::operator()() [0x557869d78252]
2022-08-04T22:11:23.132395Z	critical	envoy backtrace	#25: std::__1::function<>::operator()() [0x557869d78215]
2022-08-04T22:11:23.224456Z	critical	envoy backtrace	#26: Envoy::Event::SchedulableCallbackImpl::SchedulableCallbackImpl()::$_0::operator()() [0x557870798ab2]
2022-08-04T22:11:23.314165Z	critical	envoy backtrace	#27: Envoy::Event::SchedulableCallbackImpl::SchedulableCallbackImpl()::$_0::__invoke() [0x557870798a79]
2022-08-04T22:11:23.407180Z	critical	envoy backtrace	#28: event_process_active_single_queue [0x5578707ce10c]
2022-08-04T22:11:23.499323Z	critical	envoy backtrace	#29: event_process_active [0x5578707c8988]
2022-08-04T22:11:23.607093Z	critical	envoy backtrace	#30: event_base_loop [0x5578707c788c]
2022-08-04T22:11:23.710416Z	critical	envoy backtrace	#31: Envoy::Event::LibeventScheduler::run() [0x55787079696f]
2022-08-04T22:11:23.806877Z	critical	envoy backtrace	#32: Envoy::Event::DispatcherImpl::run() [0x557870441129]
2022-08-04T22:11:23.909969Z	critical	envoy backtrace	#33: Envoy::Server::WorkerImpl::threadRoutine() [0x55786ecdca1e]
2022-08-04T22:11:23.998286Z	critical	envoy backtrace	#34: Envoy::Server::WorkerImpl::start()::$_6::operator()() [0x55786ece6ce0]
2022-08-04T22:11:24.084979Z	critical	envoy backtrace	#35: std::__1::__invoke<>() [0x55786ece6c9d]
2022-08-04T22:11:24.181252Z	critical	envoy backtrace	#36: std::__1::__invoke_void_return_wrapper<>::__call<>() [0x55786ece6c4d]
2022-08-04T22:11:24.280837Z	critical	envoy backtrace	#37: std::__1::__function::__alloc_func<>::operator()() [0x55786ece6c1d]
2022-08-04T22:11:24.370161Z	critical	envoy backtrace	#38: std::__1::__function::__func<>::operator()() [0x55786ece5e19]
2022-08-04T22:11:24.370216Z	critical	envoy backtrace	#39: std::__1::__function::__value_func<>::operator()() [0x557869d78252]
2022-08-04T22:11:24.370226Z	critical	envoy backtrace	#40: std::__1::function<>::operator()() [0x557869d78215]
2022-08-04T22:11:24.458427Z	critical	envoy backtrace	#41: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::operator()() [0x557870c965cd]
2022-08-04T22:11:24.553854Z	critical	envoy backtrace	#42: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x557870c965a5]
2022-08-04T22:11:24.553982Z	critical	envoy backtrace	#43: start_thread [0x7fe43079fea7]

I don't know when this started as we don't have regular runs with debug images (yet)

Repro steps: I don't have a minimal repro but we hit it in our CI easily

Example failiures: https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/40290/integ-security-istiodremote_istio/1555312439514370048/artifacts/security-428b85071da243ee82d3fb/TestPassThroughFilterChain/DISABLE_with_PERMISSIVE/_test_context/echo1-2-83331-state2134015854/config-0/tproxy-v1-67ff8b6b87-8lkcl_istio-proxy.previous.log https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/40290/integ-security-istiodremote_istio/1555312439514370048/artifacts/security-428b85071da243ee82d3fb/TestPassThroughFilterChain/DISABLE_with_PERMISSIVE/_test_context/echo1-2-83331-state2134015854/config-0/b-v1-b894cccd9-mkjzf_istio-proxy.previous.log https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/40290/integ-distroless_istio/1555312434737057792/artifacts/security-5a6e9dece7ce4accb167b8/TestReachability/global_mtls_strict/_test_context/echo1-3-39127-state3347915103/primary-0/headless-v1-6bccb45b85-mbb9w_istio-proxy.previous.log https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/40290/integ-security_istio/1555312437832454144/artifacts/security-74627bac803844689bbd7f/TestPassThroughFilterChain/DISABLE/_test_context/echo1-3-55147-state4248486318/primary-0/b-v1-7ddb945cdb-9m4x5_istio-proxy.previous.log https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/40290/integ-security_istio/1555312437832454144/artifacts/security-74627bac803844689bbd7f/TestPassThroughFilterChain/DISABLE/_test_context/echo1-3-55147-state4248486318/primary-0/tproxy-v1-59f9b9d764-rw5wl_istio-proxy.previous.log

Some core dumps: https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/40290/integ-security_istio/1555312437832454144/artifacts/security-74627bac803844689bbd7f/TestPassThroughFilterChain/DISABLE/_test_context/echo1-3-55147-state4248486318/primary-0/b-v1-7ddb945cdb-9m4x5_core.proxy.25 https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/40290/integ-security-istiodremote_istio/1555312439514370048/artifacts/security-428b85071da243ee82d3fb/TestPassThroughFilterChain/DISABLE_with_PERMISSIVE/headless/to_tproxy/mtls/tcp-wl-only%28allow%29/_test_context/echo1-2-83331-state2821089782/config-0/tproxy-v1-67ff8b6b87-8lkcl_core.proxy.26 https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/40290/integ-security-istiodremote_istio/1555312439514370048/artifacts/security-428b85071da243ee82d3fb/TestPassThroughFilterChain/DISABLE_with_PERMISSIVE/headless/to_tproxy/mtls/tcp-wl-only%28allow%29/_test_context/echo1-2-83331-state2821089782/config-0/tproxy-v1-67ff8b6b87-8lkcl_core.proxy.28 https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/40290/integ-security-istiodremote_istio/1555312439514370048/artifacts/security-428b85071da243ee82d3fb/TestPassThroughFilterChain/DISABLE_with_PERMISSIVE/headless/to_tproxy/mtls/tcp-wl-only%28allow%29/_test_context/echo1-2-83331-state2821089782/config-0/tproxy-v1-67ff8b6b87-8lkcl_core.proxy.31

Envoy build: istio/proxy 01baf88a8942c273181c830b4307bfe70bd839c0, which is Envoyproxy a95d1b7f751e03c5033b84223c8b70acb786fd2b from 8/3

cc @kyessenov

howardjohn avatar Aug 05 '22 15:08 howardjohn

The stack seems to say that a pool is deferred-deleted while not idle. That's not expected, could be related to lifecycle of CDS / EDS.

kyessenov avatar Aug 08 '22 17:08 kyessenov

Anything I can do to get more info here?

howardjohn avatar Aug 08 '22 17:08 howardjohn

cc @alyssawilk I cannot point to a reason except guessing it's related to lifecycle of internal sockets. Maybe someone else can have a better idea. The next step is to create a reproducer.

kyessenov avatar Aug 08 '22 18:08 kyessenov

This one is not using internal listener at all, this is just mainline Istio from master

howardjohn avatar Aug 08 '22 18:08 howardjohn

sorry back from vacation. Can you tell me a bit more about your set up? Most confusing is it looks like this is under void ConnPoolImplBase::checkForIdleAndNotify() { if (isIdleImpl()) { ... so we're verifying the pool is idle, we're calling the idle callbacks, wer're deleting the pool, and somewhere in between it's no longer idle? feel free to ping me internally if that's easier.

alyssawilk avatar Aug 23 '22 17:08 alyssawilk

@alyssawilk I've seen that happen when Host lifecycle is broken (e.g. two Host structures created at the same memory address with ORIGINAL_DST). But there's not enough detail in the stack to tell exactly.

kyessenov avatar Aug 23 '22 17:08 kyessenov

What info do you need? I can consistently reproduce running Istio tests in CI and locally, but haven't gotten to a minimal Envoy-only config.

Here are some debug logs

logs-1.txt logs-2.txt

On the control plane side it looks like we push 2 CDS/EDS pairs back to back:

2022-08-24T14:34:00.701648Z     info    ads     CDS: PUSH for node:tproxy-v1-7797758b46-zzsw4.echo2 resources:370 size:273.0kB cached:333/353
2022-08-24T14:34:00.710138Z     info    ads     CDS: PUSH for node:tproxy-v1-7797758b46-r9254.echo1 resources:370 size:273.0kB cached:341/353
2022-08-24T14:34:00.732212Z     info    ads     EDS: PUSH for node:tproxy-v1-7797758b46-zzsw4.echo2 resources:261 size:45.0kB empty:0 cached:238/261
2022-08-24T14:34:00.739877Z     info    ads     EDS: PUSH for node:tproxy-v1-7797758b46-r9254.echo1 resources:261 size:45.0kB empty:0 cached:246/261
2022-08-24T14:34:00.759502Z     info    ads     LDS: PUSH for node:tproxy-v1-7797758b46-zzsw4.echo2 resources:250 size:1.0MB
2022-08-24T14:34:00.776893Z     info    ads     RDS: PUSH for node:tproxy-v1-7797758b46-zzsw4.echo2 resources:129 size:72.8kB cached:129/129
2022-08-24T14:34:00.777133Z     info    ads     NDS: PUSH for node:tproxy-v1-7797758b46-zzsw4.echo2 resources:1 size:3.0kB
2022-08-24T14:34:00.782627Z     info    ads     LDS: PUSH for node:tproxy-v1-7797758b46-r9254.echo1 resources:250 size:980.2kB
2022-08-24T14:34:00.802469Z     info    ads     RDS: PUSH for node:tproxy-v1-7797758b46-r9254.echo1 resources:129 size:72.8kB cached:129/129
2022-08-24T14:34:01.013185Z     info    ads     ADS: "10.244.0.48:51388" tproxy-v1-7797758b46-r9254.echo1-132 terminated

howardjohn avatar Aug 24 '22 14:08 howardjohn

For me most useful was to track down the host lifecycle: which cluster it is, what Host in it (address and memory address). I think trace gives some of that, but full details require adding more printfs.

kyessenov avatar Aug 24 '22 15:08 kyessenov

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Sep 23 '22 16:09 github-actions[bot]

This is still happening, not stale.

howardjohn avatar Sep 23 '22 16:09 howardjohn

John, if you want traction from Envoy maintainers, we need a self-contained example. This is such a core part of Envoy that it's impossible to tell what leads to it.

kyessenov avatar Sep 23 '22 16:09 kyessenov

We have it constantly reproducible on our test setup under live traffic. I've added backtrace before the assertion:

[critical][backtrace] [external/envoy/source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[critical][backtrace] [external/envoy/source/server/backtrace.h:92] Envoy version: 55f6103/1.24.0-dev/Clean/DEBUG/BoringSSL
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #0: Envoy::ConnectionPool::ConnPoolImplBase::deleteIsPendingImpl() [0x564bf554aec5]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #1: Envoy::Tcp::ConnPoolImpl::deleteIsPending() [0x564bf4523c65]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #2: Envoy::Tcp::ConnPoolImpl::deleteIsPending() [0x564bf4524d90]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #3: Envoy::Event::DispatcherImpl::deferredDelete() [0x564bf75c4363]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #4: Envoy::Upstream::ClusterManagerImpl::ThreadLocalClusterManagerImpl::tcpConnPoolIsIdle() [0x564bf442adc7]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #5: Envoy::Upstream::ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry::tcpConnPoolImpl()::$_24::operator()() [0x564bf4446ab9]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #6: std::__invoke_impl<>() [0x564bf4446996]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #7: std::__invoke_r<>() [0x564bf4446946]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #8: std::_Function_handler<>::_M_invoke() [0x564bf44466b0]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #9: std::function<>::operator()() [0x564bf25265d7]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #10: Envoy::ConnectionPool::ConnPoolImplBase::checkForIdleAndNotify() [0x564bf5560f40]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #11: Envoy::ConnectionPool::ConnPoolImplBase::checkForIdleAndCloseIdleConnsIfDraining() [0x564bf5560b30]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #12: Envoy::Tcp::ActiveTcpClient::~ActiveTcpClient() [0x564bf54d4b3e]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #13: Envoy::Tcp::ActiveTcpClient::~ActiveTcpClient() [0x564bf54d5190]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #14: Envoy::Tcp::ActiveTcpClient::~ActiveTcpClient() [0x564bf54d51bd]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #15: std::default_delete<>::operator()() [0x564bf359cd34]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #16: std::__uniq_ptr_impl<>::reset() [0x564bf75da6ad]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #17: std::unique_ptr<>::reset() [0x564bf75d0b15]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #18: Envoy::Event::DispatcherImpl::clearDeferredDeleteList() [0x564bf75bdf7b]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #19: Envoy::Event::DispatcherImpl::DispatcherImpl()::$_2::operator()() [0x564bf75cc0be]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #20: std::__invoke_impl<>() [0x564bf75cc056]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #21: std::__invoke_r<>() [0x564bf75cc026]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #22: std::_Function_handler<>::_M_invoke() [0x564bf75cbe9b]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #23: std::function<>::operator()() [0x564bf25265d7]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #24: Envoy::Event::SchedulableCallbackImpl::SchedulableCallbackImpl()::$_0::operator()() [0x564bf7c955d9]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #25: Envoy::Event::SchedulableCallbackImpl::SchedulableCallbackImpl()::$_0::__invoke() [0x564bf7c955ac]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #26: event_process_active_single_queue [0x564bf82a3aa4]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #27: event_process_active [0x564bf8289ec2]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #28: event_base_loop [0x564bf8284125]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #29: Envoy::Event::LibeventScheduler::run() [0x564bf7c8ff7f]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #30: Envoy::Event::DispatcherImpl::run() [0x564bf75c5efb]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #31: Envoy::Server::WorkerImpl::threadRoutine() [0x564bf4276118]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #32: Envoy::Server::WorkerImpl::start()::$_6::operator()() [0x564bf4279f7e]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #33: std::__invoke_impl<>() [0x564bf4279f06]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #34: std::__invoke_r<>() [0x564bf4279eb6]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #35: std::_Function_handler<>::_M_invoke() [0x564bf4279c60]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #36: std::function<>::operator()() [0x564bf25265d7]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #37: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::operator()() [0x564bf838d9e9]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #38: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x564bf838d9bc]
[critical][backtrace] [external/envoy/source/server/backtrace.h:96] #39: start_thread [0x7fa9dddb8ea7]
[critical][assert] [external/envoy/source/common/conn_pool/conn_pool_base.cc:51] assert failure: isIdleImpl().

Any thoughts? cc @alyssawilk

I can add more debug prints to extract even more information - we have it happen couple times per day.

belyalov avatar Oct 20 '22 18:10 belyalov

@belyalov This looks like regular TCP proxy? Is there something else going on (CONNECT, xDS updates, ORIGINAL_DST) on the data path?

kyessenov avatar Oct 20 '22 18:10 kyessenov

Right, @kyessenov, it is a proxy with ORIGINAL_DST, if simplified, it looks like:

// Listener definition simplified
DefaultFilterChain: &listener.FilterChain{
  Name: "default",
  Filters: []*listener.Filter{
    // our extensions
    networkMetadataFilter(),
    networkContentFilteringFilter(),
    networkRatelimitFilter(),
    // standard tcp proxy
    TCPProxyFilter(params.OriginalDstCluster, params.TCPAccessLog),
  },
},

// TCP Proxy Filter
return &listener.Filter{
  Name: "envoy.filters.network.tcp_proxy",
  ConfigType: &listener.Filter_TypedConfig{
    TypedConfig: utils.MustMarshalAny(&tcp_proxy.TcpProxy{
      AccessLog: []*accesslogv3.AccessLog{
        tcpAccessLog(logFilename, false),
      },
      StatPrefix: "tcp_proxy",
      ClusterSpecifier: &tcp_proxy.TcpProxy_Cluster{
        Cluster: clusterName,
      },
      MaxConnectAttempts: wrapperspb.UInt32(2),
    }),
  },
}

// Cluster definition
cluster := &cluster.Cluster{
  Name: name,
  ClusterDiscoveryType: &cluster.Cluster_Type{
    Type: cluster.Cluster_ORIGINAL_DST,
  },
  CircuitBreakers: &cluster.CircuitBreakers{
    Thresholds: []*cluster.CircuitBreakers_Thresholds{
      {
        Priority: core.RoutingPriority_DEFAULT,
        MaxConnections: &wrapperspb.UInt32Value{
          Value: 16384 * 100,
        },
        TrackRemaining: true,
      },
      {
        Priority: core.RoutingPriority_HIGH,
        MaxConnections: &wrapperspb.UInt32Value{
          Value: 16384 * 100,
        },
        TrackRemaining: true,
      },
    },
  },
  LbPolicy:        cluster.Cluster_CLUSTER_PROVIDED,
  DnsLookupFamily: cluster.Cluster_V4_ONLY,
}

belyalov avatar Oct 20 '22 19:10 belyalov

yeah the stack trace is confusing me. essentially it's if (isIdleImpl) { doDeletionCode(); } and doDeletionCode fails isIdleImpl checks. clearly there's some code between the original check and the final check that makes the statement no longer true but I'm having trouble finding it via code inspection. If I got you a patch with some increased logging since you have a repro case would you be able to apply and get me logs?

alyssawilk avatar Oct 20 '22 20:10 alyssawilk

@alyssawilk I remember seeing this assertion violation when manually stress testing prior to https://github.com/envoyproxy/envoy/pull/22371. It went away after that PR for HTTP usage, but I see that this is still happening on 1.24. My guess was there is some pointer address collision when synthesizing hosts rapidly. Having detailed info about when hosts get created in ORIGINAL_DST would be helpful.

kyessenov avatar Oct 20 '22 20:10 kyessenov

If I got you a patch with some increased logging since you have a repro case would you be able to apply and get me logs?

yeah, absolutely, please go for it, happy to help, @alyssawilk

belyalov avatar Oct 20 '22 20:10 belyalov

Ok no patch yet (sorry, mostly focused on EnvoyCon) but I've at least got an integration test of basic original dst cluster additions and removals both for TCP and HTTP neither of which I don't think we had previously. @kyessenov if you have ideas of how I could better tweak the tests I sent your way to repro the thrashing issues you were seeing lmk, otherwise I'll try to add some sensible logging based on those tests and ship them over to @belyalov tomorrow

alyssawilk avatar Oct 24 '22 20:10 alyssawilk

The trashing issue was solved for HTTP in the merged PR (hosts are now reused more often). What made me suspect it was seeing identical memory addresses for hosts in original_dst implementation, there I just printed the host address here. I think it's already there actually.

kyessenov avatar Oct 26 '22 17:10 kyessenov

oh awesome, thanks for adding the logging. did you need creation and deletion log points, or was just creation sufficient?

alyssawilk avatar Oct 27 '22 19:10 alyssawilk

I think the deletion event is printed by cluster already, as part of its host lifecycle management.

kyessenov avatar Oct 27 '22 20:10 kyessenov

in that case @howardjohn could you share full logs somewhere @kyessenov and I have access to?

alyssawilk avatar Nov 01 '22 16:11 alyssawilk

logs1.txt logs2.txt logs3.txt

Here are 3 different instances. Let me know if you need something else

howardjohn avatar Nov 01 '22 21:11 howardjohn

Seems like it's in tcp pool (filtered to connection):

2022-11-01T21:51:19.286022Z     trace   envoy main external/envoy/source/common/event/dispatcher_impl.cc:250    item added to deferred deletion list (size=3)   thread=45
2022-11-01T21:51:19.286040Z     trace   envoy filter external/envoy/source/common/network/listener_filter_buffer_impl.cc:95     onFileEvent: 1  thread=45
2022-11-01T21:51:19.286062Z     trace   envoy filter external/envoy/source/common/network/listener_filter_buffer_impl.cc:60     recv returned: 517      thread=45
2022-11-01T21:51:19.286081Z     trace   envoy filter external/envoy/source/extensions/filters/listener/http_inspector/http_inspector.cc:92      http inspector: http_parser parsed 0 chars, error code: 16      thread=45
2022-11-01T21:51:19.286090Z     trace   envoy filter external/envoy/source/extensions/filters/listener/http_inspector/http_inspector.cc:122     http inspector: done: false     thread=45
2022-11-01T21:51:19.286191Z     debug   envoy filter external/envoy/source/common/tcp_proxy/tcp_proxy.cc:199    [C720] new tcp proxy session    thread=45
2022-11-01T21:51:19.286229Z     trace   envoy connection external/envoy/source/common/network/connection_impl.cc:357    [C720] readDisable: disable=true disable_count=0 state=0 buffer_length=0        thread=45
2022-11-01T21:51:19.286279Z     debug   envoy filter external/envoy/source/common/tcp_proxy/tcp_proxy.cc:371    [C720] Creating connection to cluster InboundPassthroughClusterIpv4     thread=45
2022-11-01T21:51:19.286333Z     trace   envoy upstream external/envoy/source/common/upstream/original_dst_cluster.cc:53 Using existing host InboundPassthroughClusterIpv410.244.0.34:18083 10.244.0.34:18083.   thread=45
2022-11-01T21:51:19.286387Z     debug   envoy pool external/envoy/source/common/conn_pool/conn_pool_base.cc:290 trying to create new connection thread=45
2022-11-01T21:51:19.286420Z     trace   envoy pool external/envoy/source/common/conn_pool/conn_pool_base.cc:291 ConnPoolImplBase 0x55dc54b12280, ready_clients_.size(): 0, busy_clients_.size(): 0, connecting_clients_.size(): 0, connecting_stream_capacity_: 0, num_active_streams_: 0, pending_streams_.size(): 1 per upstream preconnect ratio: 1 thread=45
2022-11-01T21:51:19.286443Z     debug   envoy pool external/envoy/source/common/conn_pool/conn_pool_base.cc:145 creating a new connection (connecting=0)        thread=45
2022-11-01T21:51:19.286545Z     debug   envoy connection external/envoy/source/common/network/connection_impl.cc:924    [C722] connecting to 10.244.0.34:18083  thread=45
2022-11-01T21:51:19.286624Z     debug   envoy connection external/envoy/source/common/network/connection_impl.cc:943    [C722] connection in progress   thread=45
2022-11-01T21:51:19.286640Z     trace   envoy pool external/envoy/source/common/conn_pool/conn_pool_base.cc:131 not creating a new connection, shouldCreateNewConnection returned false.        thread=45
2022-11-01T21:51:19.286661Z     debug   envoy conn_handler external/envoy/source/server/active_tcp_listener.cc:147      [C720] new connection from 10.244.0.27:57714    thread=45
2022-11-01T21:51:19.286675Z     trace   envoy main external/envoy/source/common/event/dispatcher_impl.cc:250    item added to deferred deletion list (size=4)   thread=45
2022-11-01T21:51:19.286696Z     trace   envoy main external/envoy/source/common/event/dispatcher_impl.cc:125    clearing deferred deletion list (size=4)        thread=45
2022-11-01T21:51:19.286711Z     debug   envoy pool external/envoy/source/common/conn_pool/conn_pool_base.cc:214 [C717] destroying stream: 0 remaining   thread=45
2022-11-01T21:51:19.286728Z     debug   envoy pool external/envoy/source/common/conn_pool/conn_pool_base.cc:453 invoking idle callbacks - is_draining_for_deletion_=false       thread=45
2022-11-01T21:51:19.286748Z     trace   envoy upstream external/envoy/source/common/upstream/cluster_manager_impl.cc:1796       Idle pool, erasing pool for host InboundPassthroughClusterIpv410.244.0.34:18083 thread=45
2022-11-01T21:51:19.286758Z     critical        envoy assert external/envoy/source/common/conn_pool/conn_pool_base.cc:38        assert failure: isIdleImpl().   thread=45

kyessenov avatar Nov 03 '22 18:11 kyessenov

Looking at this one thing that I notice is each upstream connection seems to trigger a new inbound connection. @howardjohn can you talk me through your data flow here? My integration tests used original_dst with a port override so the traffic didn't go through the Envoy twice so I wonder if that's why my test case didn't repro and there's some reentrant style bug.

alyssawilk avatar Nov 03 '22 20:11 alyssawilk

I am not quite sure I follow. Wouldn't that mean there is a loop?

The request flow should just be app -> outbound sidecar -> inbound sidecar -> other app. The test is basically sending a bunch of different requests between client+server pairs. Looking more closely though, there is 1 pair that can be both the client+server - log1.txt. Note I am not sure the crash happens when calling itself, just that at some point it does. log2 and log3 shouldn't be calling themselves, though.

howardjohn avatar Nov 03 '22 20:11 howardjohn

sorry by loop I meant does the request traverse Envoy twice? It looked to me like in log1 that creating an upstream connection on your envoy resulted in creating a downstream connection for the same envoy (unless I'm misreading?)

alyssawilk avatar Nov 07 '22 14:11 alyssawilk

Yes, in log1.txt the app can be calling itself. So we have app -> outbound envoy -> inbound envoy (the same envoy instance!) -> app (the same app!).

For the other logs, this shouldn't be happening.

howardjohn avatar Nov 07 '22 15:11 howardjohn

I closed https://github.com/envoyproxy/envoy/issues/24573 to consolidate here, but FYI there is another similar assertion failure, likely the same root cause per kuat

howardjohn avatar Dec 16 '22 20:12 howardjohn

Up this thread:

I've added some debug prints and seems like it is failing on this assert:

https://github.com/envoyproxy/envoy/blob/1d8868797e4b2720e8681cba84191b33b730e3f8/source/common/conn_pool/conn_pool_base.cc#L39

[2023-07-21 17:26:37.490][16][critical][pool] [external/envoy/source/common/conn_pool/conn_pool_base.cc:44] !!! deleteIsPendingImpl() assertion: false 1 false
[2023-07-21 17:26:37.490][16][critical][pool] [external/envoy/source/common/conn_pool/conn_pool_base.cc:46] state: ConnPoolImplBase 0x5c0ee38aa00, ready_clients_.size(): 0, busy_clients_.size()
: 0, connecting_clients_.size(): 1, connecting_stream_capacity_: 1, num_active_streams_: 0, pending_streams_.size(): 1 per upstream preconnect ratio: 1
[2023-07-21 17:26:37.490][16][critical][backtrace] [external/envoy/source/server/backtrace.h:104] Caught ASSERT_CONN_POOL, suspect faulting address 0x0
[2023-07-21 17:26:37.490][16][critical][backtrace] [external/envoy/source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2023-07-21 17:26:37.490][16][critical][backtrace] [external/envoy/source/server/backtrace.h:92] Envoy version: 7583b80/1.26.3/Clean/DEBUG/BoringSSL
[2023-07-21 17:26:37.529][16][critical][backtrace] [external/envoy/source/server/backtrace.h:96] #0: Envoy::ConnectionPool::ConnPoolImplBase::deleteIsPendingImpl() [0x563605997519]
[2023-07-21 17:26:37.565][16][critical][backtrace] [external/envoy/source/server/backtrace.h:96] #1: Envoy::Tcp::ConnPoolImpl::deleteIsPending() [0x563605371216]
[2023-07-21 17:26:37.605][16][critical][backtrace] [external/envoy/source/server/backtrace.h:96] #2: Envoy::Event::DispatcherImpl::deferredDelete() [0x563606443621]
[2023-07-21 17:26:37.648][16][critical][backtrace] [external/envoy/source/server/backtrace.h:96] #3: Envoy::Upstream::ClusterManagerImpl::ThreadLocalClusterManagerImpl::tcpConnPoolIsIdle() [0x56360535eb47]

I'm not sure what it means, gonna spend a bit more time on understanding this.

Any help appreciated :)

belyalov avatar Jul 21 '23 18:07 belyalov