ztunnel
ztunnel copied to clipboard
optimize the function load_balance_for_hostname to avoid starting multiple DNS tasks
optimize the function load_balance_for_hostname to avoid starting multiple background on-demand DNS tasks.
😊 Welcome @zhlsunshine! This is either your first contribution to the Istio ztunnel repo, or it's been awhile since you've been here.
You can learn more about the Istio working groups, code of conduct, and contributing guidelines by referring to Contributing to Istio.
Thanks for contributing!
Courtesy of your friendly welcome wagon.
Hi @howardjohn & @hzxuzhonghu, is there any comment for this PR? thanks!
Will take a look
I donot fully understand the idea.
I think we just need
CachedResolvedDNSMapwhich stores the hash map of hostname to SingleFlightThe single flight only need to store the
Notify
Hi @hzxuzhonghu, I think I can give more explains here for these 2 AtomicBools:
Although state is Arc<RwLock<ProxyState>>, all the reading and writing operations themselves for ProxyState are atomic just in the code block, such as read CachedResolvedDNSMap via get_cached_resolve_dns_for_hostname and write CachedResolvedDNSMap via set_cached_resolve_dns_for_hostname, so we need these 2 AtomicBools to make sure that only the first arrive request to write the hashmap, and others need to wait for the first completed.
Hi @howardjohn & @hzxuzhonghu & @nmittler, is there anything that I can do to make it approved? Thanks!
In addition, we can have a test for it
Hi @hzxuzhonghu, sure, I added a concurrency requests test named build_concurrency_requests_unknown_dest for this change!
Hi @hzxuzhonghu & @howardjohn, I have removed 2 AtomicBool fields is_resolved_dns_running and is_resolved_dns_completed for SingleFlight and use another fields calling_index to record the request index, then trigger the DNS resolving task only when calling_index is 1 (the first request arriving). Could you please help to review it? Thanks!
Hi @hzxuzhonghu, new submit for answering your question: when will we cleanup the map?
I add initial_query and dns_refresh_rate for this map which is similar with the hashmap of by_hostname. But the refresh rate is 1 hour by default. Please let me know if you have better suggestion. Thanks.! ^_^
Need to think about this, tbh i donot think this is a good way
Hi @hzxuzhonghu, there is no standard for the refresh rate, am I right? I also notice that the expire time is 60 seconds by default for Hashmap of by_hostname, should I set by following that?
Hi @hzxuzhonghu, as discussing together, basically the ip addresses would be stored in by_hostname once the calling of resolve_on_demand_dns is completed, so it's unnecessary to refresh rate for the elemnts in sf_by_hostname.
Hi @hzxuzhonghu, I have changed the notification mechanism from tokio::sync::Notify to Mutex and Condvar which can support multiple waiters and single notifier. And the basic logic is not changed.
Hi @hzxuzhonghu any update? Thanks!
Hi @nmittler, @howardjohn and @hzxuzhonghu, PTAL this new commits, thanks!
Even though you switched to tokio structures, I still don't think this approach will work. You're still performing the wait within the
ProxyState, which is itself locked by a mutex (see the call site). This will still cause thread blocking because we're using a regularsync.Mutex(which I don't think we should change).Instead, perhaps
find_desinationshould return a future? Then the caller can unlock theProxyStatemutex before waiting on the result.
Hi @nmittler, I changed the wait_for_notifying to an async function, and call it with await, I believe that wait_for_notifying would yield to executor once it is blocked, therefore, the whold thread would not be blocked, is it okay?
Hi @howardjohn, @hzxuzhonghu and @nmittler, there is a change for this PR based on @nmittler 's suggestion, creating a new struct named DnsResolver to holds all DNS resolving relates to simplify the ProxyState, and DnsResolver is part feature of DemandProxyState. Could you please help to review it again? Thanks in advance!
Hi @nmittler, as we discussed, it becomes more simplicity.
@zhlsunshine I created https://github.com/istio/ztunnel/pull/722, which re-works this a bit based on some things we discussed. PTAL.
@zhlsunshine could we step back for a moment and discuss the motivation for this PR? Are you seeing a real issue with DNS performance, specifically due to concurrent requests for the same host? I just want to make sure we're solving a real problem here.
@zhlsunshine could we step back for a moment and discuss the motivation for this PR? Are you seeing a real issue with DNS performance, specifically due to concurrent requests for the same host? I just want to make sure we're solving a real problem here.
Hi @nmittler, frankly, I'm focusing on some validations of ambient mesh in CSP environment in recent, what I'm sure about is that the original code has the problem: there may be multiple DNS resolving for the same hostname in concurrent requests.
BTW, I have no more comment for your rework PR#722. And I also insists on that you should be a co-author of this PR thanks to your help and comments. ^_^
@zhlsunshine just to be clear, you haven't actually seen performance problems in a deployed system? My concern is fundamentally about the complexity of the solution. I'd like to see that the problem is real before we optimize for it.
@howardjohn @hzxuzhonghu any thoughts?
I agree with Nate
On Thu, Nov 9, 2023 at 10:14 AM Nathan Mittler @.***> wrote:
@zhlsunshine https://github.com/zhlsunshine just to be clear, you haven't actually seen performance problems in a deployed system? My concern is fundamentally about the complexity of the solution. I'd like to see that the problem is real before we optimize for it.
@howardjohn https://github.com/howardjohn @hzxuzhonghu https://github.com/hzxuzhonghu any thoughts?
— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/pull/672#issuecomment-1804130734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXKQHA3WWYB256NJ5ATYDT6MTAVCNFSM6AAAAAA4EL76HWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBUGEZTANZTGQ . You are receiving this because you were mentioned.Message ID: @.***>
@zhlsunshine just to be clear, you haven't actually seen performance problems in a deployed system? My concern is fundamentally about the complexity of the solution. I'd like to see that the problem is real before we optimize for it.
@howardjohn @hzxuzhonghu any thoughts?
Hi @nmittler, yes, I haven't. But multiple requests with the same hostname given in concurrency and whose resolved IP addresses are not in the DNS resolved cache, then this situation leads to the same number of times of DNS resolving operations for the same hostname, it's really waste, so this optimization should be valid even we do not test the performance for it.
@zhlsunshine
Hi @nmittler, yes, I haven't. But multiple requests with the same hostname given in concurrency and whose resolved IP addresses are not in the DNS resolved cache, then this situation leads to the same number of times of DNS resolving operations for the same hostname, it's really waste, so this optimization should be valid even we do not test the performance for it.
I understand what you're trying to optimize for. What I'm suggesting is that it may not be a huge issue in practice. I'd rather not add this type of complexity until we actually have a customer that is being impacted.
One thing to note is the behavior currently is equal to without DNS proxy. So it's not like this is fixing a regression, purely a performance benefit , and the intent of the proxy is not performance.
On Fri, Nov 10, 2023, 9:42 AM Nathan Mittler @.***> wrote:
@zhlsunshine https://github.com/zhlsunshine
Hi @nmittler https://github.com/nmittler, yes, I haven't. But multiple requests with the same hostname given in concurrency and whose resolved IP addresses are not in the DNS resolved cache, then this situation leads to the same number of times of DNS resolving operations for the same hostname, it's really waste, so this optimization should be valid even we do not test the performance for it.
I understand what you're trying to optimize for. What I'm suggesting is that it may not be a huge issue in practice. I'd rather not add this type of complexity until we actually have a customer that is being impacted.
— Reply to this email directly, view it on GitHub https://github.com/istio/ztunnel/pull/672#issuecomment-1806157400, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXM3PZ3N2OV4FSFUXPDYDZRQLAVCNFSM6AAAAAA4EL76HWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBWGE2TONBQGA . You are receiving this because you were mentioned.Message ID: @.***>
Hi @howardjohn, besides the performance change, we also did some code refactor for the DNS resolver to make it more clear, does it make sense?
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
https://github.com/istio/ztunnel/pull/1098