cookie-cutter icon indicating copy to clipboard operation
cookie-cutter copied to clipboard

Redis consumer does not reconnect after failure to ack message

Open mbowersox opened this issue 4 years ago • 2 comments

It appears that the Redis consumer will not reconnect after an ack failure. It is believe that the ack failure occurred because the stream was dropped from the Redis instance after exceeding its maximum cache size. In flight messages could not be ack'd because the stream key no longer existed. The producer to this stream re-created a new key and the consumer never recovered.

mbowersox avatar Nov 12 '20 16:11 mbowersox

couple of observations

  • the stream was lost because it exceeded its MAXLEN setting
  • the CC application reported errors acking messages
  • the CC application did not report any errors reading from the non-existent stream
  • the stream was recreated at some point, however the CC application still did not receive any new messages until restarted

things to investigate:

  • what's the behavior of the redis client when we read from a non-existent stream? does it produce an error or just return "0 messages"? is there a difference between the stream not existing to begin with vs. the stream disappearing mid-way?
  • why did the CC application no receive new messages after the stream was recreated? does it have to do with the consumer group? does it need to be recreated as well / rejoined by the application?

options to fix this issue:

  • detect when streams don't exist and throw an error from the input source which will terminate the CC application (brute force fix)
  • gracefully handle rejoining consumer groups for re-created streams (in case that turns out to be a problem)
  • streams are already created as part of the initialization logic. when the application detects that a stream is gone it could just re-create it

sklose avatar Nov 12 '20 17:11 sklose

In this case I think if the app had crashed when that first ack failed everything would have self-healed. Propagating that exception might be a decent enough short term fix.

chrnola avatar Nov 12 '20 18:11 chrnola