nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

Duplicates message received while consuming with leafnode connection

Open asambres-form3 opened this issue 2 years ago • 7 comments

Defect

Messages are delivered twice to the same client.

The NATS client receives each message 2 times. Both messages are delivered almost at the same time (<10ms delta) and are identical (Same headers, sequence id, body, delivery count)

  • We experience this issue on all our environments
  • We are able to reproduce the issue in a local kind cluster
  • Unfortunately we are not able to make up a deterministic procedure to create the issue
  • We need to restart servers multiple times in a sequence we could not understand for the problem to appear

Versions of nats-server and affected client libraries used:

nats-server v2.8.4
nats.go v1.16.0

OS/Container environment:

Running nats in Kubernetes cluster with official helm chart.

Steps or code to reproduce the issue:

Setup diagram: jetstream-issue

Given:

  • 2 nats clusters (A and B) composed of 2 nats-server instances each with jetstream running on all instances
  • Cluster A establishes leaf node connections to the cluster B
  • A go service (Application 1, Application 2) using the client lib version [email protected] is connected to the cluster B and consumes messages from a stream that lives in the cluster A (thus via a leafnode connection).

When:

  • an unexpected sequence of events happen, nats server restarts, network partitions...

Then:

  • the cluster enters incoherent state and doesn't recover (all nodes running, no network partition)
  • The application instance receives each messages 2 times
  • The duplicated messages are received in the same millisecond on the same instance
  • Restarting the application does not resolve the issue
  • The issue does not self resolve with time
  • Messages are systematically and constantly received 2 times.
  • Restarting the nats-server that sends the duplicates (the one which the client is connected to) is the ONLY way to stop that problem

Logs from a NATS server that sends duplicates (duplicates start at sequence 104): nats-0-trace.log

Expected result:

Each message should be received once from the same consumer or subscription.

Actual result:

The go client subscribing to the remote consumer receives duplicated message.

asambres-form3 avatar Jun 14 '22 15:06 asambres-form3

This situation seems to have quite a few moving pieces, some we know and some are probably not known to us. Recreating this and trying to reproduce could take several hours and we might not see anything in our setup. So, I would suggest we jump on a Zoom call and triage. Are you able to reproduce predictably?

derekcollison avatar Jun 14 '22 17:06 derekcollison

@derekcollison Yes we can reproduce semi predictably. If that works for you we can prepare the cluster to be in that situation before we jump on a call. When would you have some availability?

mlornac avatar Jun 15 '22 07:06 mlornac

If extremely important on your side I could do early PT tomorrow, otherwise early PT Monday as I am traveling to customers starting mid-day tomorrow.

Shoot me an email, [email protected]

derekcollison avatar Jun 15 '22 14:06 derekcollison

Sounds like ensuring the cluster names are unique might have resolved? ok to close?

We are looking at ways the system can help out when we detect this situation.

derekcollison avatar Jun 30 '22 18:06 derekcollison

We have the same problem with a leafnode and a source copy on the same account, for each message, two appears on the origin stream and the destination stream. Mirror on the other direction works properly.

some insights on leaf cluster:

asdasdasdasdsad

[#42] Received JetStream message: consumer: supervisor-europe-1 > cNF6m841 / subject: supervisor.processing-stats.europe-1 / delivered: 1 / consumer seq: 5 / stream seq: 5
asdasdasdasdsad

[#43] Received JetStream message: consumer: supervisor-europe-1 > cNF6m841 / subject: supervisor.processing-stats.europe-1 / delivered: 1 / consumer seq: 6 / stream seq: 6
asdasdasdasdsad

[#44] Received JetStream message: consumer: supervisor-europe-1 > cNF6m841 / subject: supervisor.processing-stats.europe-1 / delivered: 1 / consumer seq: 5 / stream seq: 5
asdasdasdasdsad

[#45] Received JetStream message: consumer: supervisor-europe-1 > cNF6m841 / subject: supervisor.processing-stats.europe-1 / delivered: 1 / consumer seq: 6 / stream seq: 6
asdasdasdasdsad

some insight on lead cluster:

[#18] Received on "supervisor.processing-stats.europe-1"
asdasdasdasdsad

[#19] Received JetStream message: consumer: supervisor-europe-1 > cNF6m841 / subject: supervisor.processing-stats.europe-1 / delivered: 1 / consumer seq: 5 / stream seq: 5
asdasdasdasdsad

[#20] Received JetStream message: consumer: supervisor-europe-1 > cNF6m841 / subject: supervisor.processing-stats.europe-1 / delivered: 1 / consumer seq: 6 / stream seq: 6
asdasdasdasdsad

I promise I only sent one message:

nats --context=global_europe_1 pub supervisor.processing-stats.europe-1
23:50:38 Reading payload from STDIN
asdasdasdasdsad
23:50:41 Published 16 bytes to "supervisor.processing-stats.europe-1"

Main Stream config:

{
  "name": "supervisor-europe-1",
  "subjects": [
    "supervisor.processing-stats.europe-1",
    "supervisor.processor-events.europe-1",
    "supervisor.search-accounting.europe-1"
  ],
  "retention": "limits",
  "max_consumers": -1,
  "max_msgs_per_subject": -1,
  "max_msgs": 1000000,
  "max_age": 31536000000000000,
  "max_msg_size": 1048576,
  "storage": "file",
  "discard": "old",
  "num_replicas": 3,
  "duplicate_window": 600000000000
}

Source stream config:

{
  "name": "agreggate-supervisor-europe-1",
  "retention": "limits",
  "max_consumers": -1,
  "max_msgs_per_subject": 0,
  "max_msgs": -1,
  "max_bytes": -1,
  "max_age": 0,
  "max_msg_size": -1,
  "storage": "file",
  "discard": "old",
  "num_replicas": 3,
  "duplicate_window": 120000000000,
  "sources": [
    {
      "name": "supervisor-europe-1",
      "external": {
        "api": "$JS.europe-1-global-leaf.API",
        "deliver": ""
      }
    }
  ],
  "sealed": false,
  "deny_delete": false,
  "deny_purge": false,
  "allow_rollup_hdrs": false
}

bloodbare avatar Aug 12 '22 21:08 bloodbare

@bloodbare Would you be able to send the configuration file of all server involved and the configuration of the JS consumer you use (if any). I see the subscription output, but not sure how you started the sub (where it connects to, what command line params you use, etc..). In short anything that you could provide to me so I can reproduce the setup and see the issue. Thanks!

kozlovic avatar Sep 02 '22 14:09 kozlovic

I saw something similar. Our setup is mqtt-> jetstream on leafnode->sourcing jetstream and When I do plain nats subsribe to "msg.originating.in.mqtt" I got duplicate messages. I did not checked jetstreams messages detaily so I'm not sure if they exist there too.

tvojacek avatar Sep 02 '22 17:09 tvojacek

I also found the same problem in use, and the configuration is basically similar to yours. image When I am linking cluster A pub message, cluster B sub message, if not linked to the leader of cluster B, then there will be duplicate messages

chenchunping avatar Oct 31 '22 11:10 chenchunping

What is the stated cluster name for A_1 and A_2? and likewise B_1 and B-2?

Also if involving JetStream they need to have unique and distinct domains.

Possible to share the configs for all servers as @kozlovic mentioned?

derekcollison avatar Oct 31 '22 21:10 derekcollison

What is the stated cluster name for A_1 and A_2? and likewise B_1 and B-2?

Also if involving JetStream they need to have unique and distinct domains.

Possible to share the configs for all servers as @kozlovic mentioned?

When sub directly links the cluster, there is a certain chance of receiving duplicate data (if the link of this client is load balanced to the normal one node will send and receive normally, otherwise it will be duplicated) and so will jetstream In the case I create the link inside the loop and pub/sub, and close the link at the end of each loop, the test code is like this.

for i := 0; i < 50; i++ {
	subject := "iot.b.topic"
	producer := &natsServer{
		user:          "usera",
		pass:          "usera",
		natsServerurl: "nats://127.0.0.1:14223",
	}
	consumer := &natsServer{
		user:          "userb",
		pass:          "userb",
		natsServerurl: "nats://127.0.0.1:24222,nats://127.0.0.1:24223",
	}
	// pub ends the message and exits the sub listener
	pubEndSingle := make(chan bool, 1)
        // sub fully exit, you can exit the main program
	subEndSingle := make(chan bool, 1)
        // sub listener starts and notifies pub that it is ready to run
	subStarSingle := make(chan bool, 1)
	var messageCount = 10
	go natsSub(subject, *consumer, messageCount, pubEndSingle, subEndSingle, subStarSingle)
	<-subStarSingle
	natsPub(subject, *producer, messageCount)
	pubEndSingle <- true
	<-subEndSingle
}

Result:

2022/11/01 10:27:34 Published [iot.b.topic] : 'hello ->2022-11-01 10:27:34.704'
client:15 receive message hello ->2022-11-01 10:27:34.704
client:15 receive message hello ->2022-11-01 10:27:34.704
2022/11/01 10:27:36 Published [iot.b.topic] : 'hello ->2022-11-01 10:27:36.805'
client:15 receive message hello ->2022-11-01 10:27:36.805
client:15 receive message hello ->2022-11-01 10:27:36.805
>>>>>>>>>>>>>  close conn,clientId 15,err:<nil>: pub 10 recv 20, repect:true

2022/11/01 10:27:38 client id :18 Listening on [iot.b.topic]
2022/11/01 10:27:38 Published [iot.b.topic] : 'hello ->2022-11-01 10:27:38.989'
client:18 receive message hello ->2022-11-01 10:27:38.989
2022/11/01 10:27:41 Published [iot.b.topic] : 'hello ->2022-11-01 10:27:41.090'
client:18 receive message hello ->2022-11-01 10:27:41.090
2022/11/01 10:27:43 Published [iot.b.topic] : 'hello ->2022-11-01 10:27:43.192'
client:18 receive message hello ->2022-11-01 10:27:43.192

Here is my configuration: 

A_1.conf

http_port : 18222
cluster : {
  name : A
  port : 15333
  routes : [nats-router://127.0.0.1:15333,nats-router://127.0.0.1:15334]
}
leafnodes : {
  port : 14333
  remotes : [{
    account : A
    urls : [nats-leaf://_AConnUser_:[email protected]:24333,nats-leaf://_AConnUser_:[email protected]:24334]
  }]
}
jetstream {
  store_dir: "/Users/ccp/Documents/git/data/storage/jetstream/A_1"
  domain: A
}
port : 14222
server_name : A_1

accounts:{
  A:{
    jetstream: enabled
    users:[
      {user: _AConnUser_, password: _AConnUser_},
      {user: usera, password: usera, permissions: {
        publish:{
          allow:["iot.b.topic"]
        },
        subscribe:{
          allow:["iot.a.topic"]
        }
      }}
    ]
    imports:[
      {stream:{account:"B", subject:"iot.a.topic"}}
    ]
  },
  B:{
    jetstream: enabled
    users:[
      {user: _BConnUser_, password: _BConnUser_},
    ]
    exports:[
      {stream: "iot.a.topic", accounts: ["A"]}
    ]
  }
}

=========================================================

A_2.conf

http_port : 18223
cluster : {
  name : A
  port : 15334
  routes : [nats-router://127.0.0.1:15333,nats-router://127.0.0.1:15334]
}
leafnodes : {
  port : 14334
  remotes : [{
    account : A
    urls : [nats-leaf://_AConnUser_:[email protected]:24333,nats-leaf://_AConnUser_:[email protected]:24334]
  }]
}
jetstream {
  store_dir: "/Users/ccp/Documents/git/data/storage/jetstream/A_2"
  domain: A
}
port : 14223
server_name : A_2

accounts:{
  A:{
    jetstream: enabled
    users:[
      {user: _AConnUser_, password: _AConnUser_},
      {user: usera, password: usera, permissions: {
        publish:{
          allow:["iot.b.topic"]
        },
        subscribe:{
          allow:["iot.a.topic"]
        }
      }}
    ]
    imports:[
      {stream:{account:"B", subject:"iot.a.topic"}}
    ]
  },
  B:{
    jetstream: enabled
    users:[
      {user: _BConnUser_, password: _BConnUser_},
    ]
    exports:[
      {stream: "iot.a.topic", accounts: ["A"]}
    ]
  }
}

=========================================================

B_1.conf

http_port : 28222
cluster : {
  name : B
  port : 25333
  routes : [nats-router://127.0.0.1:25333,nats-router://127.0.0.1:25334]
}
leafnodes : {
  port : 24333
  remotes : [{
    account : B
    urls : [nats-leaf://_BConnUser_:[email protected]:14333,nats-leaf://_BConnUser_:[email protected]:14334]
  }]
}
jetstream {
  store_dir: "/Users/ccp/Documents/git/data/storage/jetstream/B_1"
  domain: B
}
port : 24222
server_name : B_1

accounts:{
  A:{
    jetstream: enabled
    users:[
      {user: _AConnUser_, password: _AConnUser_},
    ]
    exports:[
      {stream: "iot.b.topic", accounts: ["B"]}
    ]
  },
  B:{
    jetstream: enabled
    users:[
      {user: _BConnUser_, password: _BConnUser_},
      {user: userb, password: userb, permissions: {
        publish:{
          allow:["iot.a.topic"]
        },
        subscribe:{
          allow:["iot.b.topic"]
        }
      }}
    ]
    imports:[
      {stream:{account:"A", subject:"iot.b.topic"}}
    ]
  }
}

=========================================================

B_2.conf

http_port : 28223
cluster : {
  name : B
  port : 25334
  routes : [nats-router://127.0.0.1:25333,nats-router://127.0.0.1:25334]
}
leafnodes : {
  port : 24334
  remotes : [{
    account : B
    urls : [nats-leaf://_BConnUser_:[email protected]:14333,nats-leaf://_BConnUser_:[email protected]:14334]
  }]
}
jetstream {
  store_dir: "/Users/ccp/Documents/git/data/storage/jetstream/B_2"
  domain: B
}
port : 24223
server_name : B_2

accounts:{
  A:{
    jetstream: enabled
    users:[
      {user: _AConnUser_, password: _AConnUser_},
    ]
    exports:[
      {stream: "iot.b.topic", accounts: ["B"]}
    ]
  },
  B:{
    jetstream: enabled
    users:[
      {user: _BConnUser_, password: _BConnUser_},
      {user: userb, password: userb, permissions: {
        publish:{
          allow:["iot.a.topic"]
        },
        subscribe:{
          allow:["iot.b.topic"]
        }
      }}
    ]
    imports:[
      {stream:{account:"A", subject:"iot.b.topic"}}
    ]
  }
}

Thank you sincerely for your help!

chenchunping avatar Nov 01 '22 02:11 chenchunping

What is the stated cluster name for A_1 and A_2? and likewise B_1 and B-2?

Also if involving JetStream they need to have unique and distinct domains.

Possible to share the configs for all servers as @kozlovic mentioned?

Is it not possible to create a bi-directional leafconn using an unavailable account? but I tested and found that the bi-directional linking of leafconn is allowed when I start the standalone

If I remove the import and export of the iot.a.topic in cluster A, the pub/sub of iot.b.topic is normal and there are no duplicate messages

chenchunping avatar Nov 02 '22 02:11 chenchunping

@chenchunping Thank you for providing the configuration files and step to reproduce. I am going to set that up and see if I can reproduce and make sense of what is going on. It is for sure a pretty complex setup (leafs to each other and accounts imports/exports). I will update this ticket as soon as I can.

kozlovic avatar Nov 02 '22 14:11 kozlovic

@chenchunping Just an update that I have made some progress and will resume work tomorrow.

kozlovic avatar Nov 03 '22 00:11 kozlovic

Just an update that I have made some progress and will resume work tomorrow

Thank you for all your help

chenchunping avatar Nov 03 '22 15:11 chenchunping

Note that PR #3604 addresses the issue reported by @chenchunping (with the provided configuration files and steps to reproduce). If the original poster and others that posted before that still experience the duplicate issue after upgrading to v2.9.6 (to be released with a fix for the duplicate messages), please try to provide all configuration files and steps to reproduce similar to @chenchunping. Thanks!

kozlovic avatar Nov 03 '22 19:11 kozlovic

Note that PR #3604 addresses the issue reported by @chenchunping (with the provided configuration files and steps to reproduce). If the original poster and others that posted before that still experience the duplicate issue after upgrading to v2.9.6 (to be released with a fix for the duplicate messages), please try to provide all configuration files and steps to reproduce similar to @chenchunping. Thanks!

I've tested this issue after upgrading nats 2.9.6 and it doesn't occur anymore, thank you very much for fixing this issue so quickly. We rely heavily on nats in our business and I'd like to know why this is causing duplicate messages, can you email([email protected]) or github me if you can?

chenchunping avatar Nov 04 '22 00:11 chenchunping

Note that PR #3604 addresses the issue reported by @chenchunping (with the provided configuration files and steps to reproduce). If the original poster and others that posted before that still experience the duplicate issue after upgrading to v2.9.6 (to be released with a fix for the duplicate messages), please try to provide all configuration files and steps to reproduce similar to @chenchunping. Thanks!

Thank you both @kozlovic @derekcollison

chenchunping avatar Nov 04 '22 00:11 chenchunping

@chenchunping As I have described in the PR it had to do with the fact that accounts import/export rules cause server to create a "shadow" subscription on the import subject on behalf of the account. This subscription was incorrectly propagated across the leafnodes which cause the message to go back and forth due to the subject being the same, which caused a duplicate. We are glad that the issue is now resolved.

kozlovic avatar Nov 04 '22 16:11 kozlovic

@kozlovic @derekcollison when I saw the recent changelog of v2.9.6 I was looking forward to seeing if #2694 has been resolved as well. In fact, behavior has changed but unfortunately it has not been fixed completely. I updated https://github.com/nats-io/nats-server/issues/2694#issuecomment-1311366988.

By the way, great event and great presentations yesterday at RethinkConn!

maxarndt avatar Nov 11 '22 08:11 maxarndt

Thanks, will take a look.

derekcollison avatar Nov 11 '22 16:11 derekcollison

@maxarndt if their are imports and exports on both clusters separated by a leaf node that are applicable to the same message you will get duplicates.

The hub side of the system does not understand anything about what is on the leafnode cluster, by design since it separates operational and security domains.

derekcollison avatar Dec 07 '22 02:12 derekcollison

@asambres-form3 after the network break, does nats-server-1 from cluster A re-establish a solicited leafnode connection or does it stay disconnected?

Also where is the publishing application connected? Cluster-A or Cluster-B?

derekcollison avatar Dec 07 '22 02:12 derekcollison

After the network break, nats-server-1 is able to re-establish the connection. The publishing application is connected to the Cluster-A

asambres-form3 avatar Dec 07 '22 09:12 asambres-form3

And nats-server-1 reconnects via the LB to nats-server-0 in Cluster-B?

derekcollison avatar Dec 07 '22 12:12 derekcollison

@asambres-form3 I now have a complete test that shows the issue. Will circle back when I know more.

derekcollison avatar Dec 07 '22 15:12 derekcollison

Actually I was incorrect, the test passes ok, was a typo in the test.

Will keep digging.

derekcollison avatar Dec 07 '22 16:12 derekcollison

@asambres-form3 anything I could be missing? The test is very detailed and I believe emulates your test setup AFAIK.

derekcollison avatar Dec 07 '22 16:12 derekcollison

https://github.com/nats-io/nats-server/pull/3694

derekcollison avatar Dec 07 '22 20:12 derekcollison