nsq topic: allow unbuffered memoryMsgChan (partial revert of b3b29b7eff41)

In b3b29b7eff41 / https://github.com/nsqio/nsq/pull/1159 unbuffered memoryMsgChan were replaced with nil chan for both Topic and Channel (this only happens if mem-queue-size=0). This inadvertently made deferred publish never work (with mem-queue-size=0), because some metadata is lost when messages go through the "backend" disk queue, but previously messages could go immediately through the topic's unbuffered chan with their metadata intact, to then be added to each channel's deferred pqueue. This partial revert brings this part back.

Granted, mem-queue-size=0 never worked very well, in many respects. For this functionality, at low message publish rates, it seems there was rarely a problem. But at high message publish rates it was always likely that some messages would hit the topic disk queue and lose the deferred time metadata. This potential problem remains.

In fact, the motivation for fully disabling the memoryMsgQueue for mem-queue-size=0 was to avoid excessively shuffling message order, which would happen if some messages jump instantly through the memory-queue while others take the longer way through the disk-queue. That will be likely again. But the users who noticed this probably did not use publish-deferred functionality!

Additional fixes would be appropriate, but this is a quick-fix to restore previous useful behavior for mem-queue-size=0. fixes #1375

Sep 15 '21 06:09 ploxiln

I pushed up another commit, a hacky idea to handle the case of memQueueSize=0 (or memQueueSize=1 which it seems like some people might pick with the same idea/motivation), by waiting up to a second to push through the memoryMsgQueue, but only if there's a good chance the message would be imminently consumed and copied to the channels. Comments and ideas welcome :)

Sep 19 '21 19:09 ploxiln

OK, I came up with another strategy, which I think is more efficient, fixes some long-standing potential for undesired behavior with very small MemQueueSize, and is pretty nifty. But it ran into a deadlock during tests, which I fixed easily enough (in the follow-up commit) ... but which is worrying:

test calls GetTopic() -> NewTopic() -> nsqd.Notify()
NewTopic() spawns a goroutine for topic.messagePump()
nsqd.Notify() spawns a goroutine for PersistMetadata()
test calls topic.PutMessage() -> topic.RLock() -> success
- topic.put() waits for either memoryMsgChan or inactiveChan (new in this PR)
PersistMetadata() goroutine waits for topic.Lock() (behind RLock() just taken by PutMessage())
topic.messagePump() goroutine waits for topic.RLock() to get channel list (behind pending Lock() for PersistMetadata())

fixed by making PersistMetadata() also use an RLock() too ... just a bit weird that PutMessage() can get in front of messagePump() initialization and PersistMetadata(). But it seems that if anything else could sometimes call topic.Lock() between topic.PutMessage() and messagePump processing channelUpdateChan, it could deadlock again. So that's the downside/fault of this latest strategy, which I think is otherwise pretty darn cool.

Sep 20 '21 02:09 ploxiln

Went back to the simpler idea, with an additional "started" flag.

Sep 27 '21 04:09 ploxiln

@ploxiln catching up on this...

Doesn't this effectively reduce the likelihood of messages being written to disk when mem-queue-size < 32, which is the opposite of what most users intend?

Seems like users are trying to accomplish a few things with mem-queue-size = 0:

Round trip all messages to disk
Preserve a more strict ordering (retries aside)
Still be able to use deferred pub

This fix breaks (1), somewhat handles (2), and fixes (3).

Note, it looks like we also broke ephemeral channels according to https://github.com/metal-stack/metal-roles/pull/74. This makes sense intuitively, if you don't want any messages stored in memory ephemeral channels shouldn't work? Not sure we did that intentionally though.

Nov 28 '21 18:11 mreiferson

With mem-queue-size = 0 (and this change), the messages should still hit disk, but pass more directly to the channels diskqueues if possible (and if not possible still go directly to the topic diskqueue, like nsq-1.2.1). This change should fix deferred publish, while keeping mem-queue-size = 0 similarly useful as it was before (similarly not-100%-guarantee of message persistence ... but less I suppose, because the disk write error will happen in Topic.messagePump() -> Channel.PutMessage(), instead of {request} -> Topic.PutMessage(), thus will not return error to caller, but will still SetHealth()).

The change to nil channel was somewhat motivated by a desire for message order to be more generally preserved ... and with this PR, order will also be generally preserved, because the channels will quickly drain messages from (waiting 1 second to be sent on) the topic memchan, and then write them directly to the channel diskqueue (since the channel memchan is being left nil if mem-queue-size = 0).

Nov 28 '21 22:11 ploxiln

... so yes, agreed with what you wrote, but I think this handles (2) just fine, and only somewhat breaks (1) (mainly the feedback).

Nov 28 '21 22:11 ploxiln

sorry I'm late on this... I think I follow what's happening now, but I'm still not sold. In the event consumers are slow, this creates undesirable back-pressure when you'd rather those messages get written to disk.

Why don't we just revert #1159 and take a step back with what (if anything) we want to achieve with --mem-queue-size=0?

Alternatively, I think the fix in 786c1f2 is fine — creating an ephemeral channel should arguably never write to disk. For DPUB, perhaps there's a similar highly-targeted fix (like keeping a non-nil memory message chan and only using it when a DPUB occurs?

Jan 17 '22 17:01 mreiferson

Why don't we just revert #1159

Yeah, I'm amenable to this idea. I was trying to please the no-mem-queue use-case here but agree it might not be worth all this.

Jan 17 '22 18:01 ploxiln

Why don't we just revert #1159

Yeah, I'm amenable to this idea. I was trying to please the no-mem-queue use-case here but agree it might not be worth all this.

@mreiferson Sorry for bothering but is there any progress? Our team has some resources to help fix this problem, that is, you can simply point out the potential problem location. We want to help.

Here are some of my ideas that may not be consistent with your original design：

For temporary channels, I find that the virtual back-end queue is currently useless. When memorymsgchan = 0, whether the message should also be sent to the backend queue, but the backend queue can have timeliness?
Is it possible to allow some messages to exist in the backend queue, but this message is time effective? Even if you change memorymsgchan to not nil, I also find that when the number of messages I send in batches exceeds memqueuesize, only some messages are consumed and the rest are lost

Jan 21 '22 08:01 herry-go

Updated to only change behavior (compared to latest nsq-1.2.1 release) for deferred messages or paused topics. (But it's still kinda a lot of lines.)

Aug 01 '22 02:08 ploxiln

Want it back

Aug 01 '22 06:08 Gradelia21

much reduced, squashed, re-wrote commit message

Aug 02 '22 04:08 ploxiln

Thanks for the reviews @mreiferson :)

Aug 02 '22 04:08 ploxiln