squid
squid copied to clipboard
Cleanup: Remove legacy dlink_list from proxy_match_cache
Auth::User uses a dlink_list for proxy_match_cache replace it with a std::list<acl_proxy_auth_match_cache> to make the code cleaner and more readable.
Removing this dlink container should resolve Coverity issue #1461172.
I wouldn't consider this a problem, especially if we see this as a trampoline to move to unordered_map,
On Sat, Apr 4, 2020 at 4:14 PM Alex Rousskov [email protected] wrote:
@rousskov commented on this pull request.
Please note that std::list is usually a lot more expensive than dlink (i.e., an invasive list) as far as performance is concerned because std::list does one extra memory allocation for every list element. In performance-sensitive cases, we should not replace dlink with std::list.
Do you consider proxy_match_cache to be performance sensitive?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/squid-cache/squid/pull/590#pullrequestreview-387707767, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHPVDBBE2BD44CZTPU23QTRK5FGHANCNFSM4L5P6RCA .
-- Francesco
I wouldn't consider this a problem
Noted, but why would not you? I am interested in the (missing) analysis rather than its (already implied by the PR code) conclusion.
trampoline to move to unordered_map
If std::unordered_map is the correct final destination, then why not replace the current invasive list with std::unordered_map instead of replacing a fast dlist with slow std::list and then replacing slow std::list with std::unordered_map?
On Sat, Apr 4, 2020 at 4:34 PM Alex Rousskov [email protected] wrote:
I wouldn't consider this a problem
Noted, but why would not you? I am interested in the (missing) analysis rather than its (already implied by the PR code) conclusion.
Fair point. I believe that the extra cost is dwarfed by the cost of walking the cache and the associated levels of indirection. On the other hand, letting the compiler have type information throughout might enable extra optimizations that are not available when going through void*.
trampoline to move to unordered_map
If std::unordered_map is the correct final destination, then why not replace the current invasive list with std::unordered_map instead of replacing a fast dlist with slow std::list and then replacing slow std::list with std::unordered_map?
I have refrained from using an unordered_map yet to keep the scope of the change smaller; I'd rather go through two iterations, and I'd also be happy to get other people's ideas about moving to unordered_map.
-- Francesco
I believe that the extra cost is dwarfed by the cost of walking the cache and the associated levels of indirection.
My measurements and general knowledge partially contradict that belief: A memory allocation may cost a few dozens of microseconds even when the allocator can avoid a hugely expensive system call. On the other hand, dereferencing a pointer usually costs a lot less than one microsecond!
Just like we are fighting to avoid unnecessary memory copies, we should be fighting to avoid unnecessary memory allocations. In fact, I would not be surprised if, in modern Squid, an average memory allocation is more expensive than an average memory copy!
As for "walking" the cache, you are right about its relatively huge cost, especially if the current code is using linear search. However, that is not a valid argument for ignoring the cost of an unrelated-to-the-search change, especially if there are plans to optimize that search. Here is why I am concerned about this approach:
With the "its small relative to X" approach:
- Starting with response time
X+y
, where bigX
is the search time, and smally
is the overhead of the surrounding code, including memory allocations and pointer dereferences. - Lots of expensive allocations are added (because they are much cheaper than linear search
X
). Response time increases toX + Y
. - The linear search is too slow so we add a search index. Response time decreases to
x + Y
.
With the "avoid adding expensive operations to soon-to-be-optimized code" approach:
- Starting with the same response time
X+y
- The linear search is too slow so we add a search index. Response time decreases to
x + y
.
We know that X
is much larger than y
. Your argument is valid only if x
is also much larger than Y
. Since x
(i.e. the optimized search time) is, in many hashing cases, involves little more than a few pointer dereferences, its value is likely to be comparable to y
and might even become smaller than Y
!
There are cases where making things temporary worse is a good idea. I am not convinced that this is one of them.
On the other hand, letting the compiler have type information throughout might enable extra optimizations that are not available when going through void*.
Yes, it might, even though I cannot think of an example where such optimizations would apply to this particular context. If such optimizations are indeed applicable, then this becomes a one more valid argument for implementing a C++-friendly invasive list (but there are other, much stronger arguments for doing that already).
If std::unordered_map is the correct final destination, then why not replace the current invasive list with std::unordered_map instead of replacing a fast dlist with slow std::list and then replacing slow std::list with std::unordered_map?
I have refrained from using an unordered_map yet to keep the scope of the change smaller; I'd rather go through two iterations, and I'd also be happy to get other people's ideas about moving to unordered_map.
This "small step" part of the argument does not convince me because it only works if the step is necessary and moves us in the right direction. If the final destination is unordered_map, then it is not clear to me how the intermediate step at std::list is a necessary small step -- the amount of old code changes should be about the same regardless of the STL container type in use. The difference between STL containers is largely hidden inside them -- the code using the container remains about the same regardless of the specific container type, especially in the context of a container search optimization.
And even if you are right that std::list helps with future std::unordered_map migration, we should still determine that the final destination is indeed std::unordered_map! Just like std::list, std::unordered_map allocates memory for each element. Also, AFAICT, Amos has (unknown to me) other plans for similar caching structures. In summary, it is not clear that std::unordered_map is where we are going in this context.
On Sat, Apr 4, 2020 at 6:30 PM Alex Rousskov [email protected] wrote:
I believe that the extra cost is dwarfed by the cost of walking the cache and the associated levels of indirection.
My measurements and general knowledge partially contradict that belief: A memory allocation may cost a few dozens of microseconds even when the allocator can avoid a hugely expensive system call. On the other hand, dereferencing a pointer usually costs a lot less than one microsecond!
Just like we are fighting to avoid unnecessary memory copies, we should be fighting to avoid unnecessary memory allocations. In fact, I would not be surprised if, in modern Squid, an average memory allocation is more expensive than an average memory copy!
Could be. Lots of dereferencing there.
As for "walking" the cache, you are right about its relatively huge cost, especially if the current code is using linear search. However, that is not a valid argument for ignoring the cost of an unrelated-to-the-search change, especially if there are plans to optimize that search. Here is why I am concerned about this approach:
Kind of sort of. About a year ago I attended a talk by Alexandrescu where he was showing how on modern CPUs linear search can be faster than binary search provided that the data being searched is well packed and that the vector being searched is "small enough", This is counterintuitive but he showed very interesting numbers to support. This is a different case altogether of course. How about moving to a std::vector as an intermediate step? Would reallocs be a more or less expensive cost? If we had telemetry we could reserve; we don't so we may need to guesstimate or make adaptive but this would increase code complexity.
With the "its small relative to X" approach:
- Starting with response time X+y, where big X is the search time, and small y is the overhead of the surrounding code, including memory allocations and pointer dereferences.
- Lots of expensive allocations are added (because they are much cheaper than linear search X). Response time increases to X + Y.
- The linear search is too slow so we add a search index. Response time decreases to x + Y.
With the "avoid adding expensive operations to soon-to-be-optimized code" approach:
- Starting with the same response time X+y
- The linear search is too slow so we add a search index. Response time decreases to x + y.
We know that X is much larger than y. Your argument is valid only if x is also much larger than Y. Since x (i.e. the optimized search time) is, in many hashing cases, involves little more than a few pointer dereferences, its value is likely to be comparable to y and might even become smaller than Y!
There are cases where making things temporary worse is a good idea. I am not convinced that this is one of them.
On the other hand, letting the compiler have type information throughout might enable extra optimizations that are not available when going through void*.
Yes, it might, even though I cannot think of an example where such optimizations would apply to this particular context. If such optimizations are indeed applicable, then this becomes a one more valid argument for implementing a C++-friendly invasive list (but there are other, much stronger arguments for doing that already).
If std::unordered_map is the correct final destination, then why not replace the current invasive list with std::unordered_map instead of replacing a fast dlist with slow std::list and then replacing slow std::list with std::unordered_map?
I have refrained from using an unordered_map yet to keep the scope of the change smaller; I'd rather go through two iterations, and I'd also be happy to get other people's ideas about moving to unordered_map.
This "small step" part of the argument does not convince me because it only works if the step is necessary and moves us in the right direction. If the final destination is unordered_map, then it is not clear to me how the intermediate step at std::list is a necessary small step -- the amount of old code changes should be about the same regardless of the STL container type in use. The difference between STL containers is largely hidden inside them -- the code using the container remains about the same regardless of the specific container type, especially in the context of a container search optimization.
And even if you are right that std::list helps with future std::unordered_map migration, we should still determine that the final destination is indeed std::unordered_map! Just like std::list, std::unordered_map allocates memory for each element. Also, AFAICT, Amos has (unknown to me) other plans for similar caching structures. In summary, it is not clear that std::unordered_map is where we are going in this context.
Fair point, we should see what Amos has in store here. And it could be that unordered_map has too much overhead even. It would be nice to have some kind of reference workload to measure not so much performance, but to have real-world probes on what these numbers look like in real life.
on modern CPUs linear search can be faster than binary search provided that the data being searched is well packed and that the vector being searched is "small enough"
Yes, this is probably a well-known fact by now, and, yes, we should keep it in mind when optimizing storage. I am guessing that proxy_match_cache would be too large (in interesting/relevant cases) to make linear search viable, but I have not researched this particular code. If you want to be responsible for finding the best new container for this code, then please research it so that we can make an informed decision.
BTW, the performance regression discussed in the following comment is essentially the same as the performance concern I was talking about in my comment for this PR AFAICT: https://github.com/squid-cache/squid/pull/589#discussion_r403623773
FTR: updated description to reference the Coverity issue highlighting this dlink container as an resource leak issue (minor).
In regards to the future of this data structure / container. It is one of the state caches I want to be using the ClpMap for when it gets merged.
Is merging this now going to make that work easier, or do you advise to abandon this PR?
On Tue, May 12, 2020 at 6:26 AM Amos Jeffries [email protected] wrote:
In regards to the future of this data structure / container. It is one of the state caches I want to be using the ClpMap for when it gets merged.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/squid-cache/squid/pull/590#issuecomment-627116807, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHPVDGZ6PLZCVU5HKPZJODRRDMXTANCNFSM4L5P6RCA .
-- Francesco
I don't think this change makes much difference either way, except to the coverity results. Since Alex and I are just debating the internals and final optimizations I think that one is soon to be merged and its API should be not changing now.
If you want to go ahead with this PR, I think either get it ready to merge this week-ish or maybe rebase on top of my v6-lrumap branch and be the one to do the long-term conversion for this cache.
IMO, this PR is useful even if the underlying cache type changes in some future PR: The primary contribution of this PR is in facilitating future maintenance/enhancement/support of the ACL cache by providing proper documentation, establishing proper code boundaries, and exposing performance-related decisions/concerns.
For example, it could be much easier to correctly replace std::list with X than to replace dlink_list with X because we (hopefully) would not need to fix performance regressions in the process -- those things would already be addressed in this PR (after spending a lot of time on getting to the correct outcome!).
If X existed today and this PR did not exist, we would go for replacing dlink_list with X. However, the actual situation is the opposite. I recommend merging this (cleaned up) PR unless the situation changes.
Amos: Since Alex and I are just debating the internals and final optimizations I think that one is soon to be merged and its API should be not changing now.
FWIW, I disagree with the above summary (to put it mildly), but I hope that this disagreement is largely irrelevant to this PR.
@kinkie, just FYI: Something went wrong with your PR branch -- it got many commits unrelated to this PR. This PR is (still) marked as being on your side. When you are done, please request re-review using GitHub's "Reviewers" box.
AFAICT, this PR requires @kinkie action to accommodate-or-reject my suggestions in https://github.com/squid-cache/squid/pull/590#pullrequestreview-437817565 and @yadij action to update his earlier blocking https://github.com/squid-cache/squid/pull/590#pullrequestreview-387677703.
Alex: AFAICT, this PR requires @kinkie action to accommodate-or-reject my suggestions in #590 (review) and @yadij action to update his earlier blocking #590 (review).
Amos: changing my review status. I am awaiting the resolution of the situation between @kinkie and @rousskov before I re-review the actual code resulting from that.
@kinkie, it sounds like Amos wants you to to accommodate-or-reject my suggestions before he will re-review, so the ball is fully on your side now. If you want me to address those suggestions instead, please let me know.
Abandoning, will attempt again trying to restricts scope. Bitrot makes this hard to work with