envoy original_dst: fix leak of hosts in pools

Signed-off-by: Kuat Yessenov [email protected]

Commit Message: Address memory leak in concurrent host additions in ORIGINAL_DST clusters. Additional Description: This is an attempt to avoid a global mutex on the cluster. The idea is to collect synthetic hosts for the same address and manage them collectively. Unfortunately, there's another issue present: GC might delete a host and post drain request before the connection to that host is fully established and violate an assertion (GC sees unset used_ right before chooseHost returns it). Manual testing shows this does fix the original memory leak but Envoy crashes under load due to the (non-release) assertion.

Risk Level: high Testing: manual so far, fixes memory leak but another issue uncovered Docs Changes: none Release Notes: none Fixes: #22218

Jul 22 '22 19:07 kyessenov

As a reminder, PRs marked as draft will not be automatically assigned reviewers, or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/22371 was opened by kyessenov.

see: more, trace.

Jul 22 '22 19:07 repokitteh-read-only[bot]

CC @wbpcode @mattklein123 Need more thought on how to deal with racey "new connection" / "drain connection" requests.

Jul 22 '22 19:07 kyessenov

I cannot find what actually protects Hosts from being removed/drained while connecting in other types of clusters. I found lifetimeCallbacks that are appears to not actually be used anywhere.

Jul 22 '22 22:07 kyessenov

@alyssawilk do you have any suggestions on how to properly handle host removals while connecting? with ORIGINAL_DST this can happen very rapidly, and I hit the race that drain is called before connection is fully established.

Jul 26 '22 04:07 kyessenov

without having looked at the code, would it make sense to count the host as in-use, or some other guard against reaping, for some multiple (2) of the cluster connect timeout?

Jul 27 '22 16:07 alyssawilk

That should work as long as the connect timeout O(s) not O(ms). I'm a bit worried that relying on the wall clock adds uncertainty (if CPU is starved, we might hit GC and timeout simultaneously). Do you know how EDS protects against immediate host removal?

Jul 27 '22 16:07 kyessenov

I thought with EDS we kept the hosts around until removal, even if unused? cc @adisuissa

Jul 27 '22 17:07 alyssawilk

A couple of experiments:

I can't make EDS run fast enough to trigger this issue. It requires two quick updates within micros for a request to pick an upstream host, prepare a request and then drain the upstream host
This is not an issue with tcp_proxy, just HCM with original_dst upstream.

Aug 03 '22 20:08 kyessenov

That should work as long as the connect timeout O(s) not O(ms). I'm a bit worried that relying on the wall clock adds uncertainty (if CPU is starved, we might hit GC and timeout simultaneously). Do you know how EDS protects against immediate host removal?

EDS cluster type maintains a host vector mainly for LB to pick up, and the next EDS update keeps the desired hosts to live (and to delete).

original_dst cluster doesn't have the luxury so a timer was used. alternatively could be a weak_ptr style: host map as typed map<TCP_ADDRESS, weak_ptr<Host>>, with an nontrivial host ptr deleter.

Aug 04 '22 23:08 lambdai

I think the crash is actually due to re-use of memory addresses for hosts! Because we create so many synthetic hosts, addresses happen to be re-used:

[2022-08-10 19:04:14.599][25911][debug][upstream] [source/common/upstream/original_dst_cluster.cc:67] Created host 127.0.0.1:8001 0x147d7fd61be8.
[2022-08-10 19:04:14.599][25911][trace][pool] [source/common/conn_pool/conn_pool_base.cc:110] TryCreateNewConnections 0x147d7fd61be8
[2022-08-10 19:04:14.600][25904][trace][upstream] [source/common/upstream/original_dst_cluster.cc:201] Remove host 0x147d7fd61be8
[2022-08-10 19:04:14.600][25911][debug][pool] [source/common/conn_pool/conn_pool_base.cc:326] onUpstreamReady 0x147d7fd61be8
[2022-08-10 19:04:14.601][25911][trace][pool] [source/common/conn_pool/conn_pool_base.cc:110] TryCreateNewConnections 0x147d7fd61be8
[2022-08-10 19:04:14.603][25911][trace][pool] [source/common/conn_pool/conn_pool_base.cc:110] TryCreateNewConnections 0x147d7fd61be8
[2022-08-10 19:04:16.424][25912][debug][upstream] [source/common/upstream/original_dst_cluster.cc:67] Created host 127.0.0.1:8001 0x147d7f161be8.
[2022-08-10 19:04:16.424][25912][trace][pool] [source/common/conn_pool/conn_pool_base.cc:110] TryCreateNewConnections 0x147d7f161be8
[2022-08-10 19:04:16.424][25904][trace][upstream] [source/common/upstream/original_dst_cluster.cc:201] Remove host 0x147d7f161be8
[2022-08-10 19:04:16.425][25912][debug][pool] [source/common/conn_pool/conn_pool_base.cc:326] onUpstreamReady 0x147d7f161be8
[2022-08-10 19:04:16.426][25912][trace][upstream] [source/common/upstream/cluster_manager_impl.cc:1742] Erasing idle pool for host 0x147d7f161be8
[2022-08-10 19:04:16.426][25912][trace][upstream] [source/common/upstream/cluster_manager_impl.cc:1749] Pool container empty for host 0x147d7f161be8, erasing host entry

There're two hosts created within 2 seconds with the same memory address. This somehow causes corruption in some data structure I think that then violates the draining invariant.

Aug 11 '22 02:08 kyessenov

Ok, I think the right way to fix this is as follows:

we keep this fix with concurrent addHost for the same address leaking;
we remove the logic to forcefully drain on cleanup; this is surprising and contradicts https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/service_discovery#arch-overview-service-discovery-types-original-destination. Moreover, the crash I am seeing means that sometimes drain/tryNewConnection takes longer than creating two synthetic hosts at the same memory address, which messes up the internal maps from &host;
we switch to remove hosts per thread cluster on idle like EDS; We recommend setting idle_timeout instead for HTTP
we find a way to keep synthetic hosts in memory for cleanup duration for GC purposes; not quite sure the best way, weak_ptr is not enough. Do we even need this pass if we rely on HTTP idle timeout?

Aug 11 '22 21:08 kyessenov

@lambdai @mattklein123 hope the plan above sounds good?

Aug 11 '22 21:08 kyessenov

I just spent some time paging this code back in. Sorry for the delay.

At a high level, I'm not sure all this atomic stuff is worthwhile. We already have a lock in there and grab it when mutating the host map, which actually requires copying the entire thing every time we add/remove a host. If we really have a super high rate of change my suspicion is that the copying will dwarf the lock hold time for map changes. This is ultimately what we saw with DFP. If we don't have a high rate of change, I don't think lock versus/copying matters. Perhaps we should just move to a global lock since it's much simpler and then make it better later if it's really needed?

Now, per your analysis (nice work!) I agree that cleanup() is just broken and breaks a lot of other invariants. I agree we should just remove that functionality entirely and rely on idle timeouts to do the cleanups. Perhaps for back compat we could somehow read the cleanup interval and use that to auto configure the HTTP idle timeout for HTTP cases? I'm not sure about TCP and other cases? How would that work so we avoid leaks? Do we need to keep it there which seems bad? In general though I agree that if we remove cleanup() and rely on the existing cluster manager idle functionality I don't think we need to deal with the reaping issue. (I haven't looked at the code in a while, but at it's core IIRC there are maps in the cluster manager that use the raw pointer of a host as a key. It's possible you could figure out a way to use a weak_ptr somehow, but it would be tricky for sure.)

Aug 12 '22 15:08 mattklein123

I could imagine short TCP connections could benefit from pre-constructed synthetic hosts. But it's not worth it IMHO, the cost of making a new host is small, and having weak_ptr handle deletion is a lot easier to reason about.

Aug 12 '22 16:08 kyessenov

But it's not worth it IMHO, the cost of making a new host is small, and having weak_ptr handle deletion is a lot easier to reason about.

Well the issue is that the cluster itself stores the host with a strong pointer, so I don't think it would ever get deleted, unless you somehow make the cluster itself only store weak_ptrs, but I don't know that would work. I just don't recall enough of the details here without spending a bunch more time. If it would help I'm happy to video chat to brainstorm solutions.

Aug 12 '22 16:08 mattklein123

Or are you suggesting just not storing the hosts in the cluster at all for TCP since there are no conn pools? Maybe that would work but I'm not sure how we would switch on and off that behavior without it being annoying to configure.

Aug 12 '22 16:08 mattklein123

Yes, you are right. PrioritySet holds a strong pointer and no matter how we do the weak_ptr to host dance, it's assuming too much about the number of active shared pointers held strongly. I don't really see how to register a drain event without an explicit callback, so maybe that's what's really needed by ORIGINAL_DST. Seems to make sense at a high level for a worker to "borrow" a host means signaling both creation and destruction.

Aug 31 '22 18:08 kyessenov

I have a proposal to improve ORIGINAL_DST in this PR. It goes as follows:

We extend the cluster manager to keep a reference-counted token while it has active pools.
ORIGINAL_DST cluster will keep a list of (host, host token weak_ptr) it gives to the cluster manager.
When a host token expires, the host is deleted from the priority set on the main thread. This is done at both addHost and cleanup time. The cleanup is purely for bookkeeping when a particular address is no longer chosen.

I think the host token is generally useful even if we go with a locking solution. Without locks, I think it's still possible to create two connections to the same host on one worker (if addHost races for the same address and then one worker chooses another worker's host). Whether or not we start locking the host collection, we need to know when it is safe to delete them.

Sep 01 '22 21:09 kyessenov

At a high level this seems like a good path forward, but I need to spend more time on this next week. I'm mostly OOO today. Can you fix CI and I can look on Tuesday? Thank you.

/wait

Sep 02 '22 16:09 mattklein123

Thanks, this can wait. The CI breakage is mostly just the change in signature in load_balancer.h (which would be a slog to modify so I wanted some confidence we should do that). Currently, I'm mostly load testing with a full envoy.

Sep 02 '22 16:09 kyessenov

@mattklein123 do you still want me to fix all the tests first? Do you think the interface change to load_balancer.h is good?

Sep 09 '22 17:09 kyessenov

Sorry I forgot about this. I will look more today.

Sep 09 '22 17:09 mattklein123

Discussed this offline with @kyessenov and we iterated to a slightly different solution (but still similar) that will hopefully make this PR smaller and a bit more targeted. Thank you for iterating!

/wait

Sep 09 '22 21:09 mattklein123

envoy envoy copied to clipboard

original_dst: fix leak of hosts in pools

envoy
envoy copied to clipboard