ztunnel icon indicating copy to clipboard operation
ztunnel copied to clipboard

optimize the function load_balance_for_hostname to avoid starting multiple DNS tasks

Open zhlsunshine opened this issue 2 years ago • 26 comments

optimize the function load_balance_for_hostname to avoid starting multiple background on-demand DNS tasks.

zhlsunshine avatar Aug 30 '23 11:08 zhlsunshine

😊 Welcome @zhlsunshine! This is either your first contribution to the Istio ztunnel repo, or it's been awhile since you've been here.

You can learn more about the Istio working groups, code of conduct, and contributing guidelines by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

istio-policy-bot avatar Aug 30 '23 11:08 istio-policy-bot

Hi @howardjohn & @hzxuzhonghu, is there any comment for this PR? thanks!

zhlsunshine avatar Sep 07 '23 10:09 zhlsunshine

Will take a look

hzxuzhonghu avatar Sep 07 '23 12:09 hzxuzhonghu

I donot fully understand the idea.

I think we just need CachedResolvedDNSMap which stores the hash map of hostname to SingleFlight

The single flight only need to store the Notify

Hi @hzxuzhonghu, I think I can give more explains here for these 2 AtomicBools: Although state is Arc<RwLock<ProxyState>>, all the reading and writing operations themselves for ProxyState are atomic just in the code block, such as read CachedResolvedDNSMap via get_cached_resolve_dns_for_hostname and write CachedResolvedDNSMap via set_cached_resolve_dns_for_hostname, so we need these 2 AtomicBools to make sure that only the first arrive request to write the hashmap, and others need to wait for the first completed.

zhlsunshine avatar Sep 08 '23 04:09 zhlsunshine

Hi @howardjohn & @hzxuzhonghu & @nmittler, is there anything that I can do to make it approved? Thanks!

zhlsunshine avatar Sep 18 '23 09:09 zhlsunshine

In addition, we can have a test for it

Hi @hzxuzhonghu, sure, I added a concurrency requests test named build_concurrency_requests_unknown_dest for this change!

zhlsunshine avatar Sep 19 '23 04:09 zhlsunshine

Hi @hzxuzhonghu & @howardjohn, I have removed 2 AtomicBool fields is_resolved_dns_running and is_resolved_dns_completed for SingleFlight and use another fields calling_index to record the request index, then trigger the DNS resolving task only when calling_index is 1 (the first request arriving). Could you please help to review it? Thanks!

zhlsunshine avatar Oct 06 '23 05:10 zhlsunshine

Hi @hzxuzhonghu, new submit for answering your question: when will we cleanup the map? I add initial_query and dns_refresh_rate for this map which is similar with the hashmap of by_hostname. But the refresh rate is 1 hour by default. Please let me know if you have better suggestion. Thanks.! ^_^

zhlsunshine avatar Oct 08 '23 03:10 zhlsunshine

Need to think about this, tbh i donot think this is a good way

Hi @hzxuzhonghu, there is no standard for the refresh rate, am I right? I also notice that the expire time is 60 seconds by default for Hashmap of by_hostname, should I set by following that?

zhlsunshine avatar Oct 08 '23 06:10 zhlsunshine

Hi @hzxuzhonghu, as discussing together, basically the ip addresses would be stored in by_hostname once the calling of resolve_on_demand_dns is completed, so it's unnecessary to refresh rate for the elemnts in sf_by_hostname.

zhlsunshine avatar Oct 08 '23 07:10 zhlsunshine

Hi @hzxuzhonghu, I have changed the notification mechanism from tokio::sync::Notify to Mutex and Condvar which can support multiple waiters and single notifier. And the basic logic is not changed.

zhlsunshine avatar Oct 08 '23 10:10 zhlsunshine

Hi @hzxuzhonghu any update? Thanks!

zhlsunshine avatar Oct 11 '23 01:10 zhlsunshine

Hi @nmittler, @howardjohn and @hzxuzhonghu, PTAL this new commits, thanks!

zhlsunshine avatar Oct 19 '23 04:10 zhlsunshine

Even though you switched to tokio structures, I still don't think this approach will work. You're still performing the wait within the ProxyState, which is itself locked by a mutex (see the call site). This will still cause thread blocking because we're using a regular sync.Mutex (which I don't think we should change).

Instead, perhaps find_desination should return a future? Then the caller can unlock the ProxyState mutex before waiting on the result.

Hi @nmittler, I changed the wait_for_notifying to an async function, and call it with await, I believe that wait_for_notifying would yield to executor once it is blocked, therefore, the whold thread would not be blocked, is it okay?

zhlsunshine avatar Oct 20 '23 03:10 zhlsunshine

Hi @howardjohn, @hzxuzhonghu and @nmittler, there is a change for this PR based on @nmittler 's suggestion, creating a new struct named DnsResolver to holds all DNS resolving relates to simplify the ProxyState, and DnsResolver is part feature of DemandProxyState. Could you please help to review it again? Thanks in advance!

zhlsunshine avatar Nov 03 '23 08:11 zhlsunshine

Hi @nmittler, as we discussed, it becomes more simplicity.

zhlsunshine avatar Nov 03 '23 15:11 zhlsunshine

@zhlsunshine I created https://github.com/istio/ztunnel/pull/722, which re-works this a bit based on some things we discussed. PTAL.

nmittler avatar Nov 03 '23 21:11 nmittler

@zhlsunshine could we step back for a moment and discuss the motivation for this PR? Are you seeing a real issue with DNS performance, specifically due to concurrent requests for the same host? I just want to make sure we're solving a real problem here.

nmittler avatar Nov 06 '23 13:11 nmittler

@zhlsunshine could we step back for a moment and discuss the motivation for this PR? Are you seeing a real issue with DNS performance, specifically due to concurrent requests for the same host? I just want to make sure we're solving a real problem here.

Hi @nmittler, frankly, I'm focusing on some validations of ambient mesh in CSP environment in recent, what I'm sure about is that the original code has the problem: there may be multiple DNS resolving for the same hostname in concurrent requests.
BTW, I have no more comment for your rework PR#722. And I also insists on that you should be a co-author of this PR thanks to your help and comments. ^_^

zhlsunshine avatar Nov 07 '23 06:11 zhlsunshine

@zhlsunshine just to be clear, you haven't actually seen performance problems in a deployed system? My concern is fundamentally about the complexity of the solution. I'd like to see that the problem is real before we optimize for it.

@howardjohn @hzxuzhonghu any thoughts?

nmittler avatar Nov 09 '23 16:11 nmittler

I agree with Nate

On Thu, Nov 9, 2023 at 10:14 AM Nathan Mittler @.***> wrote:

@zhlsunshine https://github.com/zhlsunshine just to be clear, you haven't actually seen performance problems in a deployed system? My concern is fundamentally about the complexity of the solution. I'd like to see that the problem is real before we optimize for it.

@howardjohn https://github.com/howardjohn @hzxuzhonghu https://github.com/hzxuzhonghu any thoughts?

— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/pull/672#issuecomment-1804130734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXKQHA3WWYB256NJ5ATYDT6MTAVCNFSM6AAAAAA4EL76HWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBUGEZTANZTGQ . You are receiving this because you were mentioned.Message ID: @.***>

howardjohn avatar Nov 10 '23 01:11 howardjohn

@zhlsunshine just to be clear, you haven't actually seen performance problems in a deployed system? My concern is fundamentally about the complexity of the solution. I'd like to see that the problem is real before we optimize for it.

@howardjohn @hzxuzhonghu any thoughts?

Hi @nmittler, yes, I haven't. But multiple requests with the same hostname given in concurrency and whose resolved IP addresses are not in the DNS resolved cache, then this situation leads to the same number of times of DNS resolving operations for the same hostname, it's really waste, so this optimization should be valid even we do not test the performance for it.

zhlsunshine avatar Nov 10 '23 02:11 zhlsunshine

@zhlsunshine

Hi @nmittler, yes, I haven't. But multiple requests with the same hostname given in concurrency and whose resolved IP addresses are not in the DNS resolved cache, then this situation leads to the same number of times of DNS resolving operations for the same hostname, it's really waste, so this optimization should be valid even we do not test the performance for it.

I understand what you're trying to optimize for. What I'm suggesting is that it may not be a huge issue in practice. I'd rather not add this type of complexity until we actually have a customer that is being impacted.

nmittler avatar Nov 10 '23 17:11 nmittler

One thing to note is the behavior currently is equal to without DNS proxy. So it's not like this is fixing a regression, purely a performance benefit , and the intent of the proxy is not performance.

On Fri, Nov 10, 2023, 9:42 AM Nathan Mittler @.***> wrote:

@zhlsunshine https://github.com/zhlsunshine

Hi @nmittler https://github.com/nmittler, yes, I haven't. But multiple requests with the same hostname given in concurrency and whose resolved IP addresses are not in the DNS resolved cache, then this situation leads to the same number of times of DNS resolving operations for the same hostname, it's really waste, so this optimization should be valid even we do not test the performance for it.

I understand what you're trying to optimize for. What I'm suggesting is that it may not be a huge issue in practice. I'd rather not add this type of complexity until we actually have a customer that is being impacted.

— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/pull/672#issuecomment-1806157400, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXM3PZ3N2OV4FSFUXPDYDZRQLAVCNFSM6AAAAAA4EL76HWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBWGE2TONBQGA . You are receiving this because you were mentioned.Message ID: @.***>

howardjohn avatar Nov 10 '23 19:11 howardjohn

Hi @howardjohn, besides the performance change, we also did some code refactor for the DNS resolver to make it more clear, does it make sense?

zhlsunshine avatar Nov 24 '23 03:11 zhlsunshine

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

istio-testing avatar Dec 19 '23 22:12 istio-testing

https://github.com/istio/ztunnel/pull/1098

howardjohn avatar May 28 '24 23:05 howardjohn