nats.rs icon indicating copy to clipboard operation
nats.rs copied to clipboard

KV Watch iterator hangs on stale bucket connections.

Open segfaultdoc opened this issue 2 years ago • 2 comments

Make sure that these boxes are checked before submitting your issue -- thank you!

  • [x] Included below version and environment information
  • [x] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

NATS version (grep 'name = "nats"' Cargo.lock -A 1)

0.23.1

rustc version (rustc --version - we support Rust 1.41 and up)

1.66.0-nightly

OS/Container environment:

Mac OS

Steps or code to reproduce the issue:

One one process/thread subscribes to watch events and another process deletes the bucket the original watch subscriber hangs on calls to next. In my example you'll see "received entry" printed by the watch thread until the bucket is deleted. Then it is no longer printed indicating it hangs on the call to next. I even create the bucket after deleting to check if any events will come thru but still no luck.

We're experiencing this same issue in our prod cluster where we run NATS 2.7.4 and this client: https://github.com/segfaultdoc/nats.rs/tree/seg-v0.18.2

  1. Run this docker container: https://github.com/segfaultdoc/nats_blocking/blob/seg/kv-bucket-stale-conns/docker-compose.yaml in one terminal
  2. Run this binary: https://github.com/segfaultdoc/nats_blocking/blob/seg/kv-bucket-stale-conns/src/main.rs with RUST_LOG=info cargo run -- --bucket-config-path bucket.yaml --nats-url localhost:4222 in another terminal

Expected result:

next should return None since the bucket was deleted and connection is stale

Actual result:

next hangs indefinitely

segfaultdoc avatar Jan 04 '23 20:01 segfaultdoc

After debugging a bit I'm seeing the subscription does not get removed from the clients internal State::ReadState::Subscriptions map. However the server stops sending messages for the subscription id.

segfaultdoc avatar Jan 04 '23 22:01 segfaultdoc

NOTE: Same issue occurs in the context of a super cluster. If Cluster A loses connection to B for example loadbalancer in B brought down, then all calls to watch in A hang

segfaultdoc avatar Jan 06 '23 15:01 segfaultdoc