go-libp2p-pubsub icon indicating copy to clipboard operation
go-libp2p-pubsub copied to clipboard

Messages get lost when using gossipsub

Open MBakhshi96 opened this issue 6 years ago • 41 comments
trafficstars

Pubsub is supposed to be reliable but I lose messages when using gossipsub. The problem is when I use around 10 nodes and all nodes try to broadcast messages to a topic in the pubsub, not all of the messages get delivered and receiving nodes lose some messages. I also have tried floodsub, but the problem still persists.

MBakhshi96 avatar Aug 22 '19 16:08 MBakhshi96

@MBakhshi96 Can you be more precise about what you mean by "losing messages"? Are your nodes actually connected to each other and have they completed their initial handshakes?

aschmahmann avatar Aug 22 '19 16:08 aschmahmann

@aschmahmann I mean that sent messages didn't get received by all of other nodes. The nodes Are connected, I also tried the fully connected configuration, but that was not helpful. I wait for 2 seconds after connecting nodes to each other and then try to subscribe them to a topic.

MBakhshi96 avatar Aug 22 '19 16:08 MBakhshi96

@MBakhshi96 we are not aware of any issues that could cause this. Can you post a test case showing the issue in a github repo? It needs to be reproduced in order to help you. Thanks.

raulk avatar Aug 22 '19 16:08 raulk

@raulk I'v added a test case showing the problem here. The example works in this way:

  • First every node broadcast a message with their id in round 1.
  • Then each node acknowledges received round 1 message and adds its own id to it. Acknowledgments are broadcast to all of the nodes.
  • Every node receives acknowledgements and prints them.

We start with n = 10 nodes. If everything works well every node must receive n*n + n messages and then the execution will terminate, But in this example the execution never stops. You can check the number of acks for every message in the output and you'll see that not all of acks are received by nodes.

MBakhshi96 avatar Aug 23 '19 09:08 MBakhshi96

Are there any logs about dropped messages?

vyzo avatar Aug 23 '19 09:08 vyzo

@vyzo Where can I find logs for this execution? There is no log in the output, but it may because of level of logging used in the pubsub code.

MBakhshi96 avatar Aug 23 '19 09:08 MBakhshi96

export IPFS_LOGGING=info

vyzo avatar Aug 23 '19 10:08 vyzo

also, what is your toplogy?

vyzo avatar Aug 23 '19 10:08 vyzo

@vyzo My topology is a simple ring, but I'v also tested it with fully connected topology. The logs are stating that messages couldn't be delivered:

INFO pubsub: Can't deliver message to subscription for topic TOPIC; subscriber too slow pubsub.go:522

I don't know what causes this problem and why these messages don't get retransmitted.

MBakhshi96 avatar Aug 23 '19 10:08 MBakhshi96

this log tells you that the pubsub subsystem is dropping messages at subscription delivery; you are simply not consuming the messages fast enough.

vyzo avatar Aug 23 '19 10:08 vyzo

note that there is no retransmission whatsoever in pubsub; also note that the messages are propagated normally, they are just dropped at delivery.

vyzo avatar Aug 23 '19 10:08 vyzo

@vyzo What do you mean by not consuming fast? I'm receiving messages inside a for loop, which simply waits for a message and then prints it in the output. How can I consume it faster? How can I prevent this situation? I mean how can I get notified that the receiver can't handle more messages and therefore stop overwhelming the receiver?

MBakhshi96 avatar Aug 23 '19 11:08 MBakhshi96

Are you running the receiver in separate goroutines?

vyzo avatar Aug 23 '19 11:08 vyzo

@vyzo yeah. You may take a look at the code I provided for reproducing the problem in previous comments. you can use the code here.

MBakhshi96 avatar Aug 23 '19 11:08 MBakhshi96

what is your message rate? it may be that your computer is too slow.

vyzo avatar Aug 23 '19 11:08 vyzo

@vyzo Actually, I don't know my message rate. In the provided example, every node will publish only 1+10 messages, but I don't know how long it takes to publish these message. Also, even if my pc is too slow, which is not, I think it's not good to lose message. There must be a way to ensure reliable message delivery.

MBakhshi96 avatar Aug 23 '19 12:08 MBakhshi96

there might be something else at play, are you receiving any messages? Maybe your receiver goroutines are not running at all.

vyzo avatar Aug 23 '19 12:08 vyzo

Also, re: drop messages: there has to be a throttle somewhere, we can't buffer an infinite number of messages.

vyzo avatar Aug 23 '19 12:08 vyzo

@vyzo Most of messages get delivered, I only lose a few messages. How can I increase the buffer capacity? I know that it's not possible to keep all of the message but the number in this case in not really huge. Also, it might be a good idea to notify publishers when recipients can't keep up with them.

MBakhshi96 avatar Aug 23 '19 12:08 MBakhshi96

there is currently no way to specify the subscription buffer size.

vyzo avatar Aug 23 '19 12:08 vyzo

@vyzo So what is your proposition? How can I circumvent this problem, since I need a reliable broadcast scheme?

MBakhshi96 avatar Aug 23 '19 12:08 MBakhshi96

You can make a pr to make the buffer capacity configurable perhaps, but this is not the solution long term. How many nodes are you running in the single computer?

vyzo avatar Aug 23 '19 13:08 vyzo

@vyzo Between 10 and 20.

MBakhshi96 avatar Aug 23 '19 13:08 MBakhshi96

that's weird, it's not a lot of nodes.

vyzo avatar Aug 23 '19 13:08 vyzo

is there any delay between message transmission, or are you sending as fast as you can?

vyzo avatar Aug 23 '19 13:08 vyzo

@vyzo There is no delay between reception and transmission.

MBakhshi96 avatar Aug 23 '19 13:08 MBakhshi96

can you add a small delay before transmitting consecutive messages?

vyzo avatar Aug 23 '19 13:08 vyzo

@vyzo I tried to add 100 milliseconds of delay before publishing to pubsub, but the problem still persists and it has got even worse!

MBakhshi96 avatar Aug 23 '19 14:08 MBakhshi96

are you blocking the receive loop with that delay? that could explain getting worse.

vyzo avatar Aug 23 '19 14:08 vyzo

@vyzo I was just inspecting the pubsub.go code and discovered here that the capacity of the channel is only 32! Also in case the channel reaches to its capacity, the code simply discards the message!

MBakhshi96 avatar Aug 23 '19 14:08 MBakhshi96