On-demand pubsub
Currently, pubsub will startup when initialized. Unfortunately, this means:
- Running three goroutines per peer (at least?).
- Processing subscriptions from peers.
This is true even if pubsub is enabled but not in-use.
In order to enable pubsub by default in go-ipfs, we need to find some way for pubsub to not take up a bunch of resources when not in-use.
The MVP solution is on-demand startup. We can start pubsub on-demand and stop after some idle timeout after the last subscription closes. This should be fairly simple to implement and will make it possible for us to turn pubsub on by default without significantly and increasing resource usage for nodes that aren't even using pubsub.
The ideal solution is idle peer detection. We don't really need to keep a stream/goroutine open per peer and could instead close streams to peers to which we have not spoken in a while. At the moment this will make the peer think we're dead so we may need a protocol change to implement this correctly.
@Stebalien What do you think of each node starting and stopping pubsub outside of libp2p? For example, there could be ipfs pubsub [start|stop]. It seems like pubsub is an application specific feature, so the applications that need pubsub start and stop the feature.
With this method, I could see there could be a problem if more than 1 application uses pubsub at the same time. I think a good solution to that would be using a semophore in the following way.
semophore = 0
function start_pubsub() {
semophore += 1
if (semophore > 1)
return // pubsub is already running
else
// start pubsub...
}
function stop_pubsub() {
if (semophore <= 0)
return // pubsub isn't running...
else {
semophore -= 1
// stop pubsub
}
}
I like this method over doing a timeout since it requires less average CPU time.
Unfortunately, I don't trust apps to properly manage a semaphore (e.g., they can crash and never reduce it).
The primary reason for a timeout is to reduce the cost of re-starting pubsub (e.g., application restart and/or configuration change). The connections held open by ipfs pubsub subscribe would effectively act as a semaphore/reference count.
Unfortunately, I don't trust apps to properly manage a semaphore (e.g., they can crash and never reduce it).
I meant the semaphore to be managed by libp2p or go-ipfs, not the application using the pubsub service. I do understand your concern. I forgot to account for if an app doesn't properly run stop_pubsub().
The primary reason for a timeout is to reduce the cost of re-starting pubsub (e.g., application restart and/or configuration change). The connections held open by
ipfs pubsub subscribewould effectively act as a semaphore/reference count.
So, the timeout would be relatively short (~10 seconds)? Also, I didn't realize starting pubsub was so taxing. Would adding a "suspend" state be useful?
That's what I was thinking (or maybe a minute to be safe?).
When we start pubsub, we'd need to:
- Register the protocol with libp2p. Libp2p would then need to tell our peers that we speak the pubsub protocol.
- Open streams (in both directions) to all peers that speak pubsub. Unfortunately, the current architecture requires leaving these streams open (which is why I'd like to be able to suspend it).
- Send/receive subscriptions from all connected peers.
This isn't terribly expensive, but it's network traffic.
I could see always keeping pubsub started. To suspend after the idle timeout, a request can be sent to all peers to not send anymore data to the opened streams and all data received after the suspend state has started can be ignored. To ignore incoming data, have a relatively small buffer contuinously accept data from a blocking read syscall and do nothing with the data.
Unfortunately, that's not really going to help.
- A lot of the cost is keeping the streams open.
- If we keep the streams open but don't accept data, there isn't much of a point. When we resume, we'll need our peer's subscription lists so we'll need to fetch them somehow.
@Stebalien wrote:
At the moment this will make the peer think we're dead so we may need a protocol change to implement this correctly.
Maybe just remove this expectation? There are peers going online and offline like all the time. So "dead" should maybe never be assumed, but instead checked on demand?
We just keep track of peers who closed the connection, and just have an upper limit like 100 peers or so with the age of like 24 hours in the cache, after which they disappear.
This would allow us to do reconnects without wasting much traffic, and just continue at the last state they had.