nodejs-pubsub icon indicating copy to clipboard operation
nodejs-pubsub copied to clipboard

Calling `.nack()` disables exponential backoff and creates log jams

Open WesCossick opened this issue 3 years ago • 3 comments

Environment details

  • OS: Debian
  • Node.js version: 16.13.1
  • npm version:
  • @google-cloud/pubsub version: 2.18.1

Steps to reproduce

  1. Create a subscription with an exponential backoff (i.e., using retryPolicy.minimumBackoff and retryPolicy.maximumBackoff)
  2. Receive a message
  3. Call .nack()
  4. See that the message is delivered again nearly immediately
  5. Call .nack() again
  6. See that the message is delivered again nearly immediately
  7. Call .nack() again
  8. See that the message is delivered again nearly immediately
  9. Call .nack() again
  10. See that the message is always delivered again nearly immediately and never respects the minimumBackoff or exponential backoff

Other notes

The immediate problem this creates is that if the subscription's maxMessages is set to 1, other messages get log-jammed behind the one that's being .nack()'d.

~~This can be partially alleviated by using the optional delay with .nack() like .nack(60), but this still wouldn't honor exponential backoffs unless the engineer built their own exponential backoff mechanism to pass .nack()` an ever-increasing value each time. But it'd be a bit disappointing to have to do that when that's a feature already baked into Google Cloud Pub/Sub.~~ EDIT: This feature has been removed, leaving no way to work around this.

Not calling .nack() creates its own problems, especially in light of the client's automatic extension of ack deadlines. See: https://stackoverflow.com/a/59128326.

A similar issue can be found here, but it's slightly different: https://github.com/googleapis/nodejs-pubsub/issues/119.

WesCossick avatar Feb 01 '22 20:02 WesCossick

Thinking about this more, simply increasing the maxMessages wouldn't solve the issue either, since let's say you increased it to 5, and there were 5 messages currently being .nack()'d as soon as they were received, you'd still create a log jam. So the underlying issue really has nothing to do with maxMessages.

WesCossick avatar Feb 01 '22 20:02 WesCossick

Just stumbled across documentation that discusses using .nack() with exponential backoff:

If messages are in a batch, Pub/Sub starts the exponential backoff when one of the following occurs:

  • The subscriber sends a negative acknowledgement for every message in the batch.

WesCossick avatar Feb 01 '22 20:02 WesCossick

The only solution to this that I've found, and it's not ideal, is to:

  1. Ensure the retryPolicy.minimumBackoff is a few seconds greater than the ackDeadline
  2. Set the flowControl.maxExtensionMinutes to 0
  3. Never .nack() a message

There are several downsides to this approach, though:

  1. Since the minimumBackoff has a maximum value of 600 seconds, this solution won't work if the ackDeadline is larger than that
  2. You can't have a minimumBackoff less than ackDeadline, even if that's desired
  3. If you want to use ack deadline extensions, you can't, since you have to disable that feature
  4. Since you can't call .nack() anymore, subscribers that can't process a message get knocked out-of-commission while they wait for the ackDeadline to expire, even when they know they're ready to move on to other messages

WesCossick avatar Feb 01 '22 21:02 WesCossick

I think we had a discussion about this before and that had a "works as expected" sort of result, but I'll try to get some attention on this again. @kamalaboulhosn ?

feywind avatar Apr 03 '23 19:04 feywind

If the exponential backoff settings are not being respected, then a support case should be entered so that more investigation can be done on the server side.

kamalaboulhosn avatar Apr 04 '23 15:04 kamalaboulhosn

@kamalaboulhosn was a support case entered? I do have this issue. I want to nack a message so it gets retried later, but i also want it to respect the exponential backoff. Right now it will just instantly retry it, which is not the desired behaviour.

vmorgado avatar May 03 '23 10:05 vmorgado

@vmorgado if you are having this issue, you should enter your own support case.

kamalaboulhosn avatar May 03 '23 15:05 kamalaboulhosn

Follow-up on this issue should be via a support case.

kamalaboulhosn avatar Jun 20 '23 17:06 kamalaboulhosn

@kamalaboulhosn Can you clarify why a bug with this @google-cloud/pubsub package needs to be resolved by Google Cloud Platform's support? I wasn't aware they had the ability to fix bugs in this package. I will note that there are plenty of other open bug issues for bugs with this package.

WesCossick avatar Jun 20 '23 18:06 WesCossick

@WesCossick Adherence to the retry settings is a server-side behavior, not a client-side behavior, which is why this needs to be investigated via a support case.

kamalaboulhosn avatar Jun 20 '23 18:06 kamalaboulhosn

@kamalaboulhosn I'd be a little surprised if this was a problem with the server side of Google Pub/Sub, since it's clearly documented how nack and exponential backoff should work together, and the Node.js client is the only client I could find that has a report of the nack and exponential backoff not working in accordance with Google's own documentation. If it was a problem on Google's server end, I would suspect other client libraries would have received similar reports.

In any case, submitting a technical support request on our end requires paying Google a fee, and paying Google to help Google debug its own software isn't something we would pay for. If you're confident it's not a client issue, then I believe the correct move would be to open an issue with the Google Pub/Sub issue tracker.

WesCossick avatar Jun 20 '23 19:06 WesCossick

@WesCossick I can assure you that the behavior of backing off is a behavior in the server, not the client. If the nack is sent by the client, the act of the backoff occurring before the next delivery happens on the server, not the client. I can't say with 100% certainty that the issue is in the server, but in order to debug further, we need to ask user-specific information and look up information on the backend and doing that requires a support case. It might be the combination of properties on the subscription, it may have to do with the delivery details of throughput or how messages are generally being processed by the user code, or it could be several other things. But in order to say for certain in any way, more information that cannot be requested without a support case is needed. Creating an issue in the public issue tracker will not be sufficient as we need to ask for non-public information.

kamalaboulhosn avatar Jun 20 '23 19:06 kamalaboulhosn

@kamalaboulhosn Purely backing off is a server-side responsibility, true. But this client has a number of nuances and complexities to how it operates, and there's no issue with how exponential backoff works as long as you don't call .nack() in this client. So that leads me to suspect it's a client issue. Here are a few nuances that might impact how nack + exponential backoff are working together:

  • This client doesn't actually send a "nack" request; instead, it sets the ack deadline to 0 and then removes the message from its own inventory.
  • This client quietly performs automatic message extensions by default.
  • This client used to support a "delayed" nack, and had been built around that concept.
  • This client manages batching for you and has a concept of leasing messages.

Seeing as you work for Google, if you're able to open a technical support request on our behalf, I'd be happy to provide more information, including non-public information. But our company isn't willing to pay Google to help Google debug its own services. We operate at a large enough scale that working around this bug isn't particularly problematic for our company. But it could be a significant performance bottleneck for other companies, so I would expect Google would be interested in resolving the problem.

WesCossick avatar Jun 20 '23 20:06 WesCossick

This client doesn't actually send a "nack" request; instead, it sets the ack deadline to 0 and then removes the message from its own inventory.

Maybe because there is no "nack" API? I think "nack" just means to zero ack deadline.

reith avatar Jun 20 '23 21:06 reith

@WesCossick Looking at the pubsub API, they don't have any special "nack" other than meaning to zero ack deadline. So, the issue might be truly on server side.

pksunkara avatar Jul 22 '23 21:07 pksunkara

@pksunkara Personally, I would conclude the opposite. Seeing as the server-side doesn't actually offer NACK, therefore calling .nack() is purely a client-side thing, if calling .nack() causes problems, I would conclude that the source of those problems therefore lies with the client.

WesCossick avatar Jul 24 '23 15:07 WesCossick

Serverside does offer nack by saying it's modAck(0).

pksunkara avatar Jul 24 '23 15:07 pksunkara