numaflow icon indicating copy to clipboard operation
numaflow copied to clipboard

fix: support DiscardNew policy for Jetstream streams

Open QuentinFAIDIDE opened this issue 10 months ago • 6 comments

might fix #1551 #1554 We would need to ensure that there are no adverse consequence in the handling of the new write error that would happen in the surge scenario.

QuentinFAIDIDE avatar Apr 01 '24 16:04 QuentinFAIDIDE

image

Something I currently experience with the "surge pipeline situation" with DiscardNew that may or may not be what @yhl25 is referring to:

  • After letting the sink fail for a moment, I am letting it work and it does has a descent Ack rate
  • the buffer before the sink stays at 30k and the cpu usage for the message-variator-udf is high while logs are repeating:
2024/04/03 18:51:32 | ERROR | {...,"msg":"Retrying failed messages","pipeline":"super-odd-8","vertex":"msg-variator","protocol":"uds-grpc-map-udf","errors":{"nats: maximum messages exceeded":31}...}

The {"nats: maximum messages exceeded":31} seems to be the new jetstream side "buffer full" error thrown because we now use DiscardNew. The number of these errors tends to slowly decrease and then go up again, indicating it's writing some, and then some other datum arrive from the source .

  • buffer the source is writing to experiences the same down and up but with a bufferFull! error, which is the normal error we usually get on buffer being fulls.

Overall, the number of messages stays nearly stable (it decrease at super slow rate on source buffer) due to the huge pile of retries that tends to immediately refill any missing data. So either this is what gave the impression of undelivered messages staying in the pipe, or either I still didn't reproduce it. I'm going to let it sit for some time and confirm I emptied the retries with no losses. I'll keep you updated.

QuentinFAIDIDE avatar Apr 03 '24 19:04 QuentinFAIDIDE

I added the doc changes to remove the retention policy parameter and default to WorkQueue as discussed. Was not able to reproduce the "stucked messages" issue yet. My guess is that one of the following is true:

  1. The new issue is due to the new "buffer over capacity" error from jetstream which is the only behavior change since we activated the WorkQueue/DiscardNew setting. The issue would then be a subset of the fixed lost datum issue, because the full buffer management is supposed to prevent that error from ever being returned in the first place.
  2. The new issue lies in Jetstream (feels unlikely but I may be wrong).
  3. The new issue is due to some other Jetstream behaviour change with WorkQueue/DiscardNew that I am not aware of. Is there likely to be any other than the new error ?

What do you guys think we should do ? I've been trying to reproduce the issue a few times with no luck, going to retry but let me know your input.

QuentinFAIDIDE avatar Apr 05 '24 13:04 QuentinFAIDIDE

as per https://github.com/nats-io/nats-server/issues/5148#issuecomment-2007087670 the issue seems to have been resolved in 2.10.12

vigith avatar Apr 05 '24 16:04 vigith

So what's the plan, do we change the new "compatible" jetstream configmap to specify only the new version with the fix, or do we wait for someone to try to reproduce this error enough times to convince us that it's fixed ?

QuentinFAIDIDE avatar Apr 10 '24 06:04 QuentinFAIDIDE

So what's the plan, do we change the new "compatible" jetstream configmap to specify only the new version with the fix, or do we wait for someone to try to reproduce this error enough times to convince us that it's fixed ?

We can make this change a configurable option, with defaults set to what is currently being used since that is battle-tested. Eventually, this could be the default, but before we do that, we need to make sure it works as expected with a decent amount of run in the production.

vigith avatar Apr 10 '24 16:04 vigith

https://github.com/nats-io/nats-server/pull/5270 seems to fix the problem. 2.10.14 release of jetstream seems very promising for WorkQueue.

vigith avatar Apr 12 '24 02:04 vigith

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 54.06%. Comparing base (e7c32c1) to head (b42786a).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1624      +/-   ##
==========================================
- Coverage   54.31%   54.06%   -0.26%     
==========================================
  Files         288      288              
  Lines       28301    28297       -4     
==========================================
- Hits        15371    15298      -73     
- Misses      11994    12063      +69     
  Partials      936      936              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jul 31 '24 19:07 codecov[bot]

Closed by https://github.com/numaproj/numaflow/pull/1884.

whynowy avatar Aug 06 '24 15:08 whynowy