nsq
nsq copied to clipboard
nsqd: discover topic/channel paused state on new topic discovery
Fix #1199
@jehiah @mreiferson @ploxiln This PR is ready for review. PTAL.
Here is the design:
- Supports send topic/channel paused state through command[register], see go-nsq pr.
- Nsqd send states to nsqlookupd when topic/channel create,delete and paused,unpaused.
- Nsqlookupd store these states, and nsqd can discover topic/channel paused state on new topic discovery through the HTTP API that nsqlookupd provides.
see more details in the code. : )
Thanks for taking a first pass at this @thekingofworld! 🙏
@jehiah @mreiferson @ploxiln This PR is ready for review. PTAL.
@mreiferson This PR is ready for review. PTAL. /cc @ploxiln @jehiah
/cc @mreiferson @ploxiln @jehiah any of you can help to take a look at this?🙏
@mreiferson Ping.
/cc @mreiferson Thank you for suggestions above, it's great. I have finished all, the code seems to more clearly, please take another look. Thanks again for your patience.
Continuing to improve! Thanks for your patience 🙏
My big high-level concern now is that these new paths in
nsqlookupd
are going to be significantly more expensive. In the/channels
case, we're doing multiple passes over data inRegistrationDB
to filter and lookup paused state in producers. Considering #1208, we should think about better data structures to use to store this data to make these newer, more complex queries performant.cc @andyxning
Agree.
In the
/channels
case, we're doing multiple passes over data inRegistrationDB
to filter and lookup paused state in producers.
After this, seems like handle the "*" case for FindRegistrations is not enough. I have to think about how to solve this whole problem. @mreiferson @andyxning Any ideas?
like we can add two additional maps to handle "*" case for FindRegistrations, we can also add additional map handle the state lookup for topic/channel, doesn't it?
@mreiferson
I added some benchmarks for the /channels
case, the execution results are as follows:
It seems still fast enough, I doubt if the situation of write lock starvation really exists?
@mreiferson Ping.
@mreiferson @jehiah @ploxiln Anyone can help to take a look at this? It has waited for more than 10 days...
Hi folks, let's keep going.[Doge] if there are any problems, please let me know and I will try my best to solve them. Let us make nsq more powerful. /cc @mreiferson @jehiah @ploxiln
hi @thekingofworld, sorry for the delay, I'll try to find some time this week to take a closer look. We do know that at larger scale there are performance issues that exist (https://github.com/nsqio/nsq/pull/1208), and this PR will make some queries perform worse.
Well, I read the related discussion about the PR #1208. I'm wondering if there is more detailed data that indicate how slow these queries are, such as how long does a query take, and how many topics, how many channels, how many producers... After recurring this slow situation, we can verify how much efficiency the new optimization scheme can improve. /cc @mreiferson
@mreiferson I added some benchmarks for the
/channels
case, the execution results are as follows:
It seems still fast enough, I doubt if the situation of write lock starvation really exists?
By the way, benchmarks show that these queries are still fast.
I think that the situation that #1208 is attempting to optimize, which the existing benchmark setups here do not emulate, is when there are many channels per topic, and they are constantly being added and removed (as ephemeral channels tend to do when consumers connect and disconnect). Each add and remove does a "write" lock/unlock of the RWMutex, and this blocks all the "read" locks of the lookups. The current benchmarks setup just two channels per topic, and do lookups without concurrent updates, and only time the lookups part.
@ploxiln yeah, the situation you described is right.
But I think the key is how FindRegistrations
will behave in *
case when db is very large.
So I added some benchmarks for this, the results are shown below:
When topic=64, channel=512, producer=512, each operation takes 11ms.
When topic=512, channel=512, producer=512, Memory exceeds 10G, I think this is beyond normal.
It seems like we can add additional maps to improve performance for
*
case.
/cc @ploxiln @mreiferson
@mreiferson @ploxiln
Let's continue to this.
Our previous discussion stopped at the performance of the case of /channels
. Recently I added an additional map to improve the performance of the query, which is essentially optimizing the function FindRegistrations
, which is called with *
.
PTAL.
Comparison before and after optimization:
Performance increased by 40%.
Corns:
- This will add some pressure on the memory since a new memory will be allocated.
@mreiferson @ploxiln Ping.
Thanks for continuing to push this forward @thekingofworld.
The approach here will now be much more costly on the write (registration) side, because each write will now have to walk the whole structure and produce the cached "high performance" map.
Instead, we should concurrently update both the standard and new "high performance" maps during each mutation.
The discussion in https://github.com/nsqio/nsq/pull/1208#issuecomment-549509152 (from "if sync.Map everywhere is too much overhead" down) and https://github.com/nsqio/nsq/pull/1208#issuecomment-643821646 has the most relevant context for what needs to happen here.
Instead, we should concurrently update both the standard and new "high performance" maps during each mutation.
agree, update highspeed map during each mutation instead walk the whole structure and produce the cached "high performance" map.
@mreiferson done, PTAL.
@mreiferson ping. let's finish it...