nsq icon indicating copy to clipboard operation
nsq copied to clipboard

nsqd: REQ without altering attempts

Open tj opened this issue 11 years ago • 20 comments

we have some cases where we have to wait around for some distributed locks so I just keep requeueing the messages to allow messages over other types (that won't collide with those locks) to flow through. Problem is we also need a pretty low maxAttempts

tj avatar Jun 20 '14 22:06 tj

@visionmedia are you proposing that the REQ command would get a parameter to not increment attempts?

A few thoughts...

One problem is that (according to nsqd) you really have attempted the message whatever N times it's been sent to a consumer. It's hard to argue that it's not accurate...

And implementation wise they're disconnected - attempts is incremented on send, not on REQ, so it would be tricky to keep that state around.

What do you think @jehiah

mreiferson avatar Jun 21 '14 15:06 mreiferson

yup, and I agree, it's weird but the ability to give the message back to NSQ side-effect free is definitely something we'll use a lot. Since the client handles discards anyway we could have yet-another client-side layer in redis that helps keep track of what is/isn't a real attempt, but all these layers are getting a little crazy haha. The other problem is that this is at a large scale, 5+ million messages in-flight at any given time so it eventually gets non-trivial to introduce tooling for the weird little edge-cases

I definitely feel like a lot of these are pretty specific to us, and might warrant a fork but I just like bringing them up in case someone else has had similar issues.

tj avatar Jun 21 '14 17:06 tj

Another valid use-case:

When we put redshift in maintenance mode or resize a cluster we need to requeue those messages with a delay, but this also shouldn't count towards it's number of attempts, otherwise we'll lose very large copies containing potentially millions of messages. Under normal circumstances one or two attempts is just fine so they're definitely separate cases IMO

tj avatar Jun 22 '14 02:06 tj

pause the channel while it's in maintenance mode :smile: - don't have the consumers pound it into the ground while performing an operational procedure on the cluster, right?

mreiferson avatar Jun 22 '14 05:06 mreiferson

it's a shared topic/channel ATM :(

tj avatar Jun 22 '14 05:06 tj

@jehiah care to weigh in on your thoughts here

mreiferson avatar Jun 29 '14 13:06 mreiferson

FWIW I'm rewriting the entire thing in Go over the weekend haha, changing how we're handling things now that I understand the edge-cases better. My first case isn't relevant anymore but the second use-case of clusters being under maintenance etc is still relevant

tj avatar Jun 29 '14 17:06 tj

The case you are talking about is where you consume a channel and messages in it fan out to N independent clusters and you are putting one of those clusters in maintenance and want a way to avoid burning your possible attempts against the cluster in maintenance while handling attempts normally for the other clusters. correct?

The combination of consumer backoff, and per-message retry/backoff are entirely meant to deal w/ this state. (individual messages get retried at increasing delays to last beyond your maintenance window, and you process slower burning fewer retries even if messages are ready to be retried) If this is a special maintenance state, it sounds like you might be able to 'finish' these messages when they hit a cluster in maintenance and push them to a second topic/channel where you apply different (higher ) max retry attempts and probably a different strategy to requeueing and backoff.

I think it's hard for nsq to give good primitives for more fine granular controls in this situation without an ability to tag messages with additional metadata that gets passed through. We've actively avoided that metadata because it's often more properly associated w/ the consumer (ie: which cluster a message maps to) rather than the producer.

jehiah avatar Jun 29 '14 18:06 jehiah

cluster == redshift cluster in this case, they have mandatory weekly scheduled downtime. If the backoff logic was tailored to user logic that might work ok, if cluster A is under maintenance it backs off and B trickles through fine. The second queue thing could work, more stuff to manage but it would work I guess

tj avatar Jun 29 '14 20:06 tj

I realize this point might be moot since you're moving things to go-nsq which does all of this for you, but implementing backoff (both in slowing down the rate of consumption and deferring requeues) would be useful in nsq.js for this exact reason (to @jehiah's points)

mreiferson avatar Jun 30 '14 17:06 mreiferson

Another nice thing you could use this for is to analyze what's in the queue without having any real effect on it. I guess you could FIN and PUB but that seems a little weird

tj avatar Jul 11 '14 16:07 tj

hmm I keep coming across more and more use-cases for this. Even if I pushed them to another nsqd or topic I need to process those per-client as well, and we have too many clients to have separate topics, so I have pretty much no choice but to REQ with a reasonable delay. It can take anywhere from 3 hours to 3 days to ETL this data though so I can't rely on a large REQ being good enough.

Since lots of nsq relies on pushing logic to the client, I think it's reasonable to have this behaviour. Whether or not the client makes an actual attempt to process the data is up to the client I'd think.

tj avatar Jul 16 '14 17:07 tj

I need to think about this more.

I still have implementation concerns and "does this belong in the core" concerns so I need to sift through those feelings and come up with a reasonable rebuttal or blessing.

Anyone lurking who watches the repo and has any feelings on this now would be the time to weigh in :+1: or :-1:

mreiferson avatar Jul 16 '14 23:07 mreiferson

Possible hack: finish the message and re-publish to the same topic. Since messages will be broadcast to subscribing channels, you would unfortunately have to include something in the payload for channels to ignore unless the message originated from that channel. A possible downside is that messages processed in nsqadmin could be inflated. The other possible issue is if you care about order.

Another solution: Making some very broad assumptions about the problem you're trying to solve regarding acquiring locks, one possible solution would be to have a first topic/channel pair where the retry attempts is fairly high. The channel would on a per message basis attempt to acquire the lock and REQ if the lock is unavailable. Once the lock is available, publish the message to another topic/channel with said lock and have it process but this time with the small number of REQ attempts.

dudleycarr avatar Jul 17 '14 00:07 dudleycarr

yea I was thinking about FIN/PUB, I guess the downsides I can think of would be:

  • it's not atomic
  • publishing would introduce another localized NSQD (unless we could check the origin but that's a little weird)
  • skews the metrics

Reducing the REQ attempts would definitely help, but I guess for me there's a conflict with the idea of what makes an attempt. Does receiving it count as an attempt? Or does actually processing it make it one. Then it also makes you alter max_attempts to allow for these cases vs the "real" max_attempts that you'd want

Might be able to rework things with a non-nsq solution but I thinkkkk this is still a legit thing, core or not is tricky though

tj avatar Jul 17 '14 15:07 tj

@visionmedia I haven't forgotten about this, I just wanted to get the stable release out the door to pave the way for focussing on new things...

mreiferson avatar Jul 25 '14 21:07 mreiferson

no worries! It's nothing too urgent on our end

tj avatar Jul 25 '14 21:07 tj

How about a NoAttempt method on messages and extending the protocol internally to include NOA or something similar?

On consumer shutdown, all messages in flight could be remarked as not attempted and nsqd wouldn't penalize them with an attempt.

This would also be nice for bitly/go-nsq#96.

twmb avatar Mar 23 '15 04:03 twmb

@twmb I don't think the specific implementation was ever a question (and your suggestion makes a lot of sense).

I think the question has always been does it fundamentally make sense to allow this?

mreiferson avatar Mar 23 '15 22:03 mreiferson

Lurking and chiming in here. At least as stated, :-1:. MaxAttempts is a client side implementation, you're free to set it to 0 (infinite) and have your own logic for when to FIN a message. I don't think giving nsqd the ability to lie about the Attempts number is a good solution. If we knew a client could cause this number to be inaccurate we would have less confidence in our logs, where we store the Attempts number for both successes and failures.

@tj I realize this issue is > 1 year old and things may have changed. I'm not sure what having a shared topic/channel means in your context - do you have different message types which come through the same channel, and do some internal routing to a handler inside your code? I personally would change to a single purpose topic/channel, it's extremely useful to have operational control over a well defined type of messages.

If you're already using some custom routing, why not have custom MaxAttempts handling also? This way nsqd doesn't change behavior others may depend on and you can handle messages as you wish.

Regarding NoAttempt, if you're not changing the semantics of Attempts, I'm imagining a property that contained data for "this is the number of attempts which definitely failed" (whatever that means in your context) versus the number of overall attempts. Adding a field would be a break in protocol spec, unfortunately. Also I don't see any end to the metadata you may want to store about a message.

A general solution may be if you don't want the OOB way of counting Attempts then set MaxAttempts = 0 and store the topic / channel / message-id combination in a data store to track any arbitrary data you need about the message.

judwhite avatar Sep 28 '15 18:09 judwhite