nats-server
nats-server copied to clipboard
Duplicates message received while consuming with leafnode connection
Defect
Messages are delivered twice to the same client.
The NATS client receives each message 2 times. Both messages are delivered almost at the same time (<10ms delta) and are identical (Same headers, sequence id, body, delivery count)
- We experience this issue on all our environments
- We are able to reproduce the issue in a local kind cluster
- Unfortunately we are not able to make up a deterministic procedure to create the issue
- We need to restart servers multiple times in a sequence we could not understand for the problem to appear
Versions of nats-server
and affected client libraries used:
nats-server v2.8.4
nats.go v1.16.0
OS/Container environment:
Running nats in Kubernetes cluster with official helm chart.
Steps or code to reproduce the issue:
Setup diagram:
Given:
- 2 nats clusters (A and B) composed of 2 nats-server instances each with jetstream running on all instances
- Cluster A establishes leaf node connections to the cluster B
- A go service (Application 1, Application 2) using the client lib version
[email protected]
is connected to the cluster B and consumes messages from a stream that lives in the cluster A (thus via a leafnode connection).
When:
- an unexpected sequence of events happen, nats server restarts, network partitions...
Then:
- the cluster enters incoherent state and doesn't recover (all nodes running, no network partition)
- The application instance receives each messages 2 times
- The duplicated messages are received in the same millisecond on the same instance
- Restarting the application does not resolve the issue
- The issue does not self resolve with time
- Messages are systematically and constantly received 2 times.
- Restarting the nats-server that sends the duplicates (the one which the client is connected to) is the ONLY way to stop that problem
Logs from a NATS server that sends duplicates (duplicates start at sequence 104): nats-0-trace.log
Expected result:
Each message should be received once from the same consumer or subscription.
Actual result:
The go client subscribing to the remote consumer receives duplicated message.
This situation seems to have quite a few moving pieces, some we know and some are probably not known to us. Recreating this and trying to reproduce could take several hours and we might not see anything in our setup. So, I would suggest we jump on a Zoom call and triage. Are you able to reproduce predictably?
@derekcollison Yes we can reproduce semi predictably. If that works for you we can prepare the cluster to be in that situation before we jump on a call. When would you have some availability?
If extremely important on your side I could do early PT tomorrow, otherwise early PT Monday as I am traveling to customers starting mid-day tomorrow.
Shoot me an email, [email protected]
Sounds like ensuring the cluster names are unique might have resolved? ok to close?
We are looking at ways the system can help out when we detect this situation.
We have the same problem with a leafnode and a source copy on the same account, for each message, two appears on the origin stream and the destination stream. Mirror on the other direction works properly.
some insights on leaf cluster:
asdasdasdasdsad
[#42] Received JetStream message: consumer: supervisor-europe-1 > cNF6m841 / subject: supervisor.processing-stats.europe-1 / delivered: 1 / consumer seq: 5 / stream seq: 5
asdasdasdasdsad
[#43] Received JetStream message: consumer: supervisor-europe-1 > cNF6m841 / subject: supervisor.processing-stats.europe-1 / delivered: 1 / consumer seq: 6 / stream seq: 6
asdasdasdasdsad
[#44] Received JetStream message: consumer: supervisor-europe-1 > cNF6m841 / subject: supervisor.processing-stats.europe-1 / delivered: 1 / consumer seq: 5 / stream seq: 5
asdasdasdasdsad
[#45] Received JetStream message: consumer: supervisor-europe-1 > cNF6m841 / subject: supervisor.processing-stats.europe-1 / delivered: 1 / consumer seq: 6 / stream seq: 6
asdasdasdasdsad
some insight on lead cluster:
[#18] Received on "supervisor.processing-stats.europe-1"
asdasdasdasdsad
[#19] Received JetStream message: consumer: supervisor-europe-1 > cNF6m841 / subject: supervisor.processing-stats.europe-1 / delivered: 1 / consumer seq: 5 / stream seq: 5
asdasdasdasdsad
[#20] Received JetStream message: consumer: supervisor-europe-1 > cNF6m841 / subject: supervisor.processing-stats.europe-1 / delivered: 1 / consumer seq: 6 / stream seq: 6
asdasdasdasdsad
I promise I only sent one message:
nats --context=global_europe_1 pub supervisor.processing-stats.europe-1
23:50:38 Reading payload from STDIN
asdasdasdasdsad
23:50:41 Published 16 bytes to "supervisor.processing-stats.europe-1"
Main Stream config:
{
"name": "supervisor-europe-1",
"subjects": [
"supervisor.processing-stats.europe-1",
"supervisor.processor-events.europe-1",
"supervisor.search-accounting.europe-1"
],
"retention": "limits",
"max_consumers": -1,
"max_msgs_per_subject": -1,
"max_msgs": 1000000,
"max_age": 31536000000000000,
"max_msg_size": 1048576,
"storage": "file",
"discard": "old",
"num_replicas": 3,
"duplicate_window": 600000000000
}
Source stream config:
{
"name": "agreggate-supervisor-europe-1",
"retention": "limits",
"max_consumers": -1,
"max_msgs_per_subject": 0,
"max_msgs": -1,
"max_bytes": -1,
"max_age": 0,
"max_msg_size": -1,
"storage": "file",
"discard": "old",
"num_replicas": 3,
"duplicate_window": 120000000000,
"sources": [
{
"name": "supervisor-europe-1",
"external": {
"api": "$JS.europe-1-global-leaf.API",
"deliver": ""
}
}
],
"sealed": false,
"deny_delete": false,
"deny_purge": false,
"allow_rollup_hdrs": false
}
@bloodbare Would you be able to send the configuration file of all server involved and the configuration of the JS consumer you use (if any). I see the subscription output, but not sure how you started the sub (where it connects to, what command line params you use, etc..). In short anything that you could provide to me so I can reproduce the setup and see the issue. Thanks!
I saw something similar. Our setup is mqtt-> jetstream on leafnode->sourcing jetstream and When I do plain nats subsribe to "msg.originating.in.mqtt" I got duplicate messages. I did not checked jetstreams messages detaily so I'm not sure if they exist there too.
I also found the same problem in use, and the configuration is basically similar to yours.
When I am linking cluster A pub message, cluster B sub message, if not linked to the leader of cluster B, then there will be duplicate messages
What is the stated cluster name for A_1 and A_2? and likewise B_1 and B-2?
Also if involving JetStream they need to have unique and distinct domains.
Possible to share the configs for all servers as @kozlovic mentioned?
What is the stated cluster name for A_1 and A_2? and likewise B_1 and B-2?
Also if involving JetStream they need to have unique and distinct domains.
Possible to share the configs for all servers as @kozlovic mentioned?
When sub directly links the cluster, there is a certain chance of receiving duplicate data (if the link of this client is load balanced to the normal one node will send and receive normally, otherwise it will be duplicated) and so will jetstream In the case I create the link inside the loop and pub/sub, and close the link at the end of each loop, the test code is like this.
for i := 0; i < 50; i++ {
subject := "iot.b.topic"
producer := &natsServer{
user: "usera",
pass: "usera",
natsServerurl: "nats://127.0.0.1:14223",
}
consumer := &natsServer{
user: "userb",
pass: "userb",
natsServerurl: "nats://127.0.0.1:24222,nats://127.0.0.1:24223",
}
// pub ends the message and exits the sub listener
pubEndSingle := make(chan bool, 1)
// sub fully exit, you can exit the main program
subEndSingle := make(chan bool, 1)
// sub listener starts and notifies pub that it is ready to run
subStarSingle := make(chan bool, 1)
var messageCount = 10
go natsSub(subject, *consumer, messageCount, pubEndSingle, subEndSingle, subStarSingle)
<-subStarSingle
natsPub(subject, *producer, messageCount)
pubEndSingle <- true
<-subEndSingle
}
Result:
2022/11/01 10:27:34 Published [iot.b.topic] : 'hello ->2022-11-01 10:27:34.704'
client:15 receive message hello ->2022-11-01 10:27:34.704
client:15 receive message hello ->2022-11-01 10:27:34.704
2022/11/01 10:27:36 Published [iot.b.topic] : 'hello ->2022-11-01 10:27:36.805'
client:15 receive message hello ->2022-11-01 10:27:36.805
client:15 receive message hello ->2022-11-01 10:27:36.805
>>>>>>>>>>>>> close conn,clientId 15,err:<nil>: pub 10 recv 20, repect:true
2022/11/01 10:27:38 client id :18 Listening on [iot.b.topic]
2022/11/01 10:27:38 Published [iot.b.topic] : 'hello ->2022-11-01 10:27:38.989'
client:18 receive message hello ->2022-11-01 10:27:38.989
2022/11/01 10:27:41 Published [iot.b.topic] : 'hello ->2022-11-01 10:27:41.090'
client:18 receive message hello ->2022-11-01 10:27:41.090
2022/11/01 10:27:43 Published [iot.b.topic] : 'hello ->2022-11-01 10:27:43.192'
client:18 receive message hello ->2022-11-01 10:27:43.192
Here is my configuration: 
A_1.conf
http_port : 18222
cluster : {
name : A
port : 15333
routes : [nats-router://127.0.0.1:15333,nats-router://127.0.0.1:15334]
}
leafnodes : {
port : 14333
remotes : [{
account : A
urls : [nats-leaf://_AConnUser_:[email protected]:24333,nats-leaf://_AConnUser_:[email protected]:24334]
}]
}
jetstream {
store_dir: "/Users/ccp/Documents/git/data/storage/jetstream/A_1"
domain: A
}
port : 14222
server_name : A_1
accounts:{
A:{
jetstream: enabled
users:[
{user: _AConnUser_, password: _AConnUser_},
{user: usera, password: usera, permissions: {
publish:{
allow:["iot.b.topic"]
},
subscribe:{
allow:["iot.a.topic"]
}
}}
]
imports:[
{stream:{account:"B", subject:"iot.a.topic"}}
]
},
B:{
jetstream: enabled
users:[
{user: _BConnUser_, password: _BConnUser_},
]
exports:[
{stream: "iot.a.topic", accounts: ["A"]}
]
}
}
=========================================================
A_2.conf
http_port : 18223
cluster : {
name : A
port : 15334
routes : [nats-router://127.0.0.1:15333,nats-router://127.0.0.1:15334]
}
leafnodes : {
port : 14334
remotes : [{
account : A
urls : [nats-leaf://_AConnUser_:[email protected]:24333,nats-leaf://_AConnUser_:[email protected]:24334]
}]
}
jetstream {
store_dir: "/Users/ccp/Documents/git/data/storage/jetstream/A_2"
domain: A
}
port : 14223
server_name : A_2
accounts:{
A:{
jetstream: enabled
users:[
{user: _AConnUser_, password: _AConnUser_},
{user: usera, password: usera, permissions: {
publish:{
allow:["iot.b.topic"]
},
subscribe:{
allow:["iot.a.topic"]
}
}}
]
imports:[
{stream:{account:"B", subject:"iot.a.topic"}}
]
},
B:{
jetstream: enabled
users:[
{user: _BConnUser_, password: _BConnUser_},
]
exports:[
{stream: "iot.a.topic", accounts: ["A"]}
]
}
}
=========================================================
B_1.conf
http_port : 28222
cluster : {
name : B
port : 25333
routes : [nats-router://127.0.0.1:25333,nats-router://127.0.0.1:25334]
}
leafnodes : {
port : 24333
remotes : [{
account : B
urls : [nats-leaf://_BConnUser_:[email protected]:14333,nats-leaf://_BConnUser_:[email protected]:14334]
}]
}
jetstream {
store_dir: "/Users/ccp/Documents/git/data/storage/jetstream/B_1"
domain: B
}
port : 24222
server_name : B_1
accounts:{
A:{
jetstream: enabled
users:[
{user: _AConnUser_, password: _AConnUser_},
]
exports:[
{stream: "iot.b.topic", accounts: ["B"]}
]
},
B:{
jetstream: enabled
users:[
{user: _BConnUser_, password: _BConnUser_},
{user: userb, password: userb, permissions: {
publish:{
allow:["iot.a.topic"]
},
subscribe:{
allow:["iot.b.topic"]
}
}}
]
imports:[
{stream:{account:"A", subject:"iot.b.topic"}}
]
}
}
=========================================================
B_2.conf
http_port : 28223
cluster : {
name : B
port : 25334
routes : [nats-router://127.0.0.1:25333,nats-router://127.0.0.1:25334]
}
leafnodes : {
port : 24334
remotes : [{
account : B
urls : [nats-leaf://_BConnUser_:[email protected]:14333,nats-leaf://_BConnUser_:[email protected]:14334]
}]
}
jetstream {
store_dir: "/Users/ccp/Documents/git/data/storage/jetstream/B_2"
domain: B
}
port : 24223
server_name : B_2
accounts:{
A:{
jetstream: enabled
users:[
{user: _AConnUser_, password: _AConnUser_},
]
exports:[
{stream: "iot.b.topic", accounts: ["B"]}
]
},
B:{
jetstream: enabled
users:[
{user: _BConnUser_, password: _BConnUser_},
{user: userb, password: userb, permissions: {
publish:{
allow:["iot.a.topic"]
},
subscribe:{
allow:["iot.b.topic"]
}
}}
]
imports:[
{stream:{account:"A", subject:"iot.b.topic"}}
]
}
}
Thank you sincerely for your help!
What is the stated cluster name for A_1 and A_2? and likewise B_1 and B-2?
Also if involving JetStream they need to have unique and distinct domains.
Possible to share the configs for all servers as @kozlovic mentioned?
Is it not possible to create a bi-directional leafconn using an unavailable account? but I tested and found that the bi-directional linking of leafconn is allowed when I start the standalone
If I remove the import and export of the iot.a.topic in cluster A, the pub/sub of iot.b.topic is normal and there are no duplicate messages
@chenchunping Thank you for providing the configuration files and step to reproduce. I am going to set that up and see if I can reproduce and make sense of what is going on. It is for sure a pretty complex setup (leafs to each other and accounts imports/exports). I will update this ticket as soon as I can.
@chenchunping Just an update that I have made some progress and will resume work tomorrow.
Just an update that I have made some progress and will resume work tomorrow
Thank you for all your help
Note that PR #3604 addresses the issue reported by @chenchunping (with the provided configuration files and steps to reproduce). If the original poster and others that posted before that still experience the duplicate issue after upgrading to v2.9.6 (to be released with a fix for the duplicate messages), please try to provide all configuration files and steps to reproduce similar to @chenchunping. Thanks!
Note that PR #3604 addresses the issue reported by @chenchunping (with the provided configuration files and steps to reproduce). If the original poster and others that posted before that still experience the duplicate issue after upgrading to v2.9.6 (to be released with a fix for the duplicate messages), please try to provide all configuration files and steps to reproduce similar to @chenchunping. Thanks!
I've tested this issue after upgrading nats 2.9.6 and it doesn't occur anymore, thank you very much for fixing this issue so quickly. We rely heavily on nats in our business and I'd like to know why this is causing duplicate messages, can you email([email protected]) or github me if you can?
Note that PR #3604 addresses the issue reported by @chenchunping (with the provided configuration files and steps to reproduce). If the original poster and others that posted before that still experience the duplicate issue after upgrading to v2.9.6 (to be released with a fix for the duplicate messages), please try to provide all configuration files and steps to reproduce similar to @chenchunping. Thanks!
Thank you both @kozlovic @derekcollison
@chenchunping As I have described in the PR it had to do with the fact that accounts import/export rules cause server to create a "shadow" subscription on the import subject on behalf of the account. This subscription was incorrectly propagated across the leafnodes which cause the message to go back and forth due to the subject being the same, which caused a duplicate. We are glad that the issue is now resolved.
@kozlovic @derekcollison when I saw the recent changelog of v2.9.6 I was looking forward to seeing if #2694 has been resolved as well. In fact, behavior has changed but unfortunately it has not been fixed completely. I updated https://github.com/nats-io/nats-server/issues/2694#issuecomment-1311366988.
By the way, great event and great presentations yesterday at RethinkConn!
Thanks, will take a look.
@maxarndt if their are imports and exports on both clusters separated by a leaf node that are applicable to the same message you will get duplicates.
The hub side of the system does not understand anything about what is on the leafnode cluster, by design since it separates operational and security domains.
@asambres-form3 after the network break, does nats-server-1 from cluster A re-establish a solicited leafnode connection or does it stay disconnected?
Also where is the publishing application connected? Cluster-A or Cluster-B?
After the network break, nats-server-1 is able to re-establish the connection. The publishing application is connected to the Cluster-A
And nats-server-1 reconnects via the LB to nats-server-0 in Cluster-B?
@asambres-form3 I now have a complete test that shows the issue. Will circle back when I know more.
Actually I was incorrect, the test passes ok, was a typo in the test.
Will keep digging.
@asambres-form3 anything I could be missing? The test is very detailed and I believe emulates your test setup AFAIK.
https://github.com/nats-io/nats-server/pull/3694