go-libp2p-pubsub
go-libp2p-pubsub copied to clipboard
Messages get lost when using gossipsub
Pubsub is supposed to be reliable but I lose messages when using gossipsub. The problem is when I use around 10 nodes and all nodes try to broadcast messages to a topic in the pubsub, not all of the messages get delivered and receiving nodes lose some messages. I also have tried floodsub, but the problem still persists.
@MBakhshi96 Can you be more precise about what you mean by "losing messages"? Are your nodes actually connected to each other and have they completed their initial handshakes?
@aschmahmann I mean that sent messages didn't get received by all of other nodes. The nodes Are connected, I also tried the fully connected configuration, but that was not helpful. I wait for 2 seconds after connecting nodes to each other and then try to subscribe them to a topic.
@MBakhshi96 we are not aware of any issues that could cause this. Can you post a test case showing the issue in a github repo? It needs to be reproduced in order to help you. Thanks.
@raulk I'v added a test case showing the problem here. The example works in this way:
- First every node broadcast a message with their id in round 1.
- Then each node acknowledges received round 1 message and adds its own id to it. Acknowledgments are broadcast to all of the nodes.
- Every node receives acknowledgements and prints them.
We start with n = 10 nodes. If everything works well every node must receive n*n + n messages and then the execution will terminate, But in this example the execution never stops. You can check the number of acks for every message in the output and you'll see that not all of acks are received by nodes.
Are there any logs about dropped messages?
@vyzo Where can I find logs for this execution? There is no log in the output, but it may because of level of logging used in the pubsub code.
export IPFS_LOGGING=info
also, what is your toplogy?
@vyzo My topology is a simple ring, but I'v also tested it with fully connected topology. The logs are stating that messages couldn't be delivered:
INFO pubsub: Can't deliver message to subscription for topic TOPIC; subscriber too slow pubsub.go:522
I don't know what causes this problem and why these messages don't get retransmitted.
this log tells you that the pubsub subsystem is dropping messages at subscription delivery; you are simply not consuming the messages fast enough.
note that there is no retransmission whatsoever in pubsub; also note that the messages are propagated normally, they are just dropped at delivery.
@vyzo What do you mean by not consuming fast? I'm receiving messages inside a for loop, which simply waits for a message and then prints it in the output. How can I consume it faster? How can I prevent this situation? I mean how can I get notified that the receiver can't handle more messages and therefore stop overwhelming the receiver?
Are you running the receiver in separate goroutines?
@vyzo yeah. You may take a look at the code I provided for reproducing the problem in previous comments. you can use the code here.
what is your message rate? it may be that your computer is too slow.
@vyzo Actually, I don't know my message rate. In the provided example, every node will publish only 1+10 messages, but I don't know how long it takes to publish these message. Also, even if my pc is too slow, which is not, I think it's not good to lose message. There must be a way to ensure reliable message delivery.
there might be something else at play, are you receiving any messages? Maybe your receiver goroutines are not running at all.
Also, re: drop messages: there has to be a throttle somewhere, we can't buffer an infinite number of messages.
@vyzo Most of messages get delivered, I only lose a few messages. How can I increase the buffer capacity? I know that it's not possible to keep all of the message but the number in this case in not really huge. Also, it might be a good idea to notify publishers when recipients can't keep up with them.
there is currently no way to specify the subscription buffer size.
@vyzo So what is your proposition? How can I circumvent this problem, since I need a reliable broadcast scheme?
You can make a pr to make the buffer capacity configurable perhaps, but this is not the solution long term. How many nodes are you running in the single computer?
@vyzo Between 10 and 20.
that's weird, it's not a lot of nodes.
is there any delay between message transmission, or are you sending as fast as you can?
@vyzo There is no delay between reception and transmission.
can you add a small delay before transmitting consecutive messages?
@vyzo I tried to add 100 milliseconds of delay before publishing to pubsub, but the problem still persists and it has got even worse!
are you blocking the receive loop with that delay? that could explain getting worse.
@vyzo I was just inspecting the pubsub.go code and discovered here that the capacity of the channel is only 32! Also in case the channel reaches to its capacity, the code simply discards the message!