envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Merging implementations of Strict and Logical DNS

Open grnmeira opened this issue 1 year ago • 23 comments

Commit Message: Additional Description:

The main idea here is tackling #37145 (which is a follow up from #36353).

This approach simplifies the code merging the two current implementation of DNS cluster (logical and strict) by removing unnecessary logical DNS, as it can be treated as a special case of strict DNS.

According to the documentation:

When using strict DNS service discovery, Envoy will continuously and asynchronously resolve the specified DNS targets. Each returned IP address in the DNS result will be considered an explicit host in the upstream cluster. This means that if the query returns three IP addresses, Envoy will assume the cluster has three hosts, and all three should be load balanced to. If a host is removed from the result Envoy assumes it no longer exists and will drain traffic from any existing connection pools

Logical DNS uses a similar asynchronous resolution mechanism to strict DNS. However, instead of strictly taking the results of the DNS query and assuming that they comprise the entire upstream cluster, a logical DNS cluster only uses the first IP address returned when a new connection needs to be initiated. Thus, a single logical connection pool may contain physical connections to a variety of different upstream hosts. Connections are never drained, including on a successful DNS resolution that returns 0 hosts.

Risk Level: Testing: Docs Changes: Release Notes: Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional API Considerations:]

grnmeira avatar Feb 18 '25 13:02 grnmeira

As a reminder, PRs marked as draft will not be automatically assigned reviewers, or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/38483 was opened by grnmeira.

see: more, trace.

@markdroth if you're interested, as this is a follow up from #36353.

grnmeira avatar Feb 18 '25 16:02 grnmeira

/wait Looks like CI is failing?

RyanTheOptimist avatar Mar 18 '25 00:03 RyanTheOptimist

I'm taking over this temporarily, will take a look when I have time

Stevenjin8 avatar Mar 25 '25 13:03 Stevenjin8

A high level comment first: How does this change impact Envoy Happy-Eyeball support: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/connection_pooling#arch-overview-happy-eyeballs

yanjunxiang-google avatar Mar 25 '25 20:03 yanjunxiang-google

@yanjunxiang-google I'm moving this to my backlog, so it my be a while before I'll take a look. Sorry about that!

Stevenjin8 avatar Mar 28 '25 18:03 Stevenjin8

/wait

Code coverage for source/extensions/clusters/common is lower than limit of 96.6 (88.7)

ravenblackx avatar Apr 24 '25 16:04 ravenblackx

DnsCluster wasn't using LogicalHost, which impacted not only coverage, but also the use of Happy Eyeballs (@yanjunxiang-google). Hopefully with these most recent changes we shouldn't have these problems. I'm still testing it locally to make sure Happy Eyeballs is Happy 🙃

grnmeira avatar May 02 '25 12:05 grnmeira

Is this ready for maintainer review?

jmarantz avatar May 05 '25 13:05 jmarantz

@jmarantz yes, this is ready for maintainer review.

grnmeira avatar May 06 '25 16:05 grnmeira

LGTM modulo CI failure

yanjunxiang-google avatar May 08 '25 20:05 yanjunxiang-google

/wait

jmarantz avatar May 09 '25 15:05 jmarantz

/retest

grnmeira avatar May 09 '25 20:05 grnmeira

LGTM modulo CI failure

I think it was just some hiccups with integration tests. Should be good now.

grnmeira avatar May 12 '25 09:05 grnmeira

LGTM

yanjunxiang-google avatar May 14 '25 00:05 yanjunxiang-google

/assign-from @envoyproxy/senior-maintainers

kyessenov avatar May 14 '25 17:05 kyessenov

@envoyproxy/senior-maintainers assignee is @mattklein123

:cat:

Caused by: a https://github.com/envoyproxy/envoy/pull/38483#issuecomment-2881021095 was created by @kyessenov.

see: more, trace.

@wbpcode Thanks for the thorough review. I tried to address your suggestions the best I could. I totally see the problem with logical hosts and the updates based on the first address returned in the response. I've created a unit test for that which can reproduce the issue and now is of course passing after implementing your suggestion.

grnmeira avatar Jun 05 '25 15:06 grnmeira

@wbpcode Thanks for the thorough review. I tried to address your suggestions the best I could. I totally see the problem with logical hosts and the updates based on the first address returned in the response. I've created a unit test for that which can reproduce the issue and now is of course passing after implementing your suggestion.

Thanks for the update. And I will take a look again at weekend or next monday.

wbpcode avatar Jun 06 '25 07:06 wbpcode

PTAL when you get a chance @wbpcode

KBaichoo avatar Jun 10 '25 19:06 KBaichoo

/retest

grnmeira avatar Jun 13 '25 15:06 grnmeira

/wait

wbpcode avatar Jun 14 '25 04:06 wbpcode

I'm working now on including the feature flag and restoring the previous implementations.

grnmeira avatar Jun 19 '25 10:06 grnmeira

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/38483 was synchronize by grnmeira.

see: more, trace.

I'll be unavailable for the next few days. There's still some details to be sorted in the tests mainly. I'll be back from the 30th of June.

grnmeira avatar Jun 22 '25 12:06 grnmeira

/wait-any

Waiting for addressing comments and conflicts

botengyao avatar Jun 25 '25 15:06 botengyao

Thanks for the update. There are lots of new commit? Is this ready for review? And we may prefer to land this after 1.35 is released (in next one week ideally).

wbpcode avatar Jul 07 '25 02:07 wbpcode

Thanks for the update. There are lots of new commit? Is this ready for review? And we may prefer to land this after 1.35 is released (in next one week ideally).

Yes, it was a significant change. I had to re-introduce the old implementations. So there were lots of changes mostly in tests.

grnmeira avatar Jul 07 '25 08:07 grnmeira

@wbpcode I had some issues with coverage (as I mistakenly hadn't parametrized some of the unit tests). Now CI is passing, we should be good for a review again. Thank you.

grnmeira avatar Jul 07 '25 12:07 grnmeira

gentle ping @wbpcode

botengyao avatar Jul 25 '25 18:07 botengyao