go-ipld-prime icon indicating copy to clipboard operation
go-ipld-prime copied to clipboard

Parallel selectors

Open Stebalien opened this issue 3 years ago • 3 comments

Currently, selectors lookup and explore iteratively. Unfortunately, this means we often have to pause while going to disk.

Instead, we should be reading from disk in parallel. The simplest approach is likely to parallelize:

https://github.com/ipld/go-ipld-prime/blob/dc342a9917db92e76bd6d71b9a6d2a7d88c70fe2/traversal/walk.go#L143-L173

Or at least do a bit of lookahead to load a few path segments.

Stebalien avatar Mar 29 '21 23:03 Stebalien

Few things:

  1. I started on an implementation to enable a number of more flexible types of walk including parallel walking, breadth first, and a traversal you control the advancement of at each step -- https://github.com/ipld/go-walker -- but haven't finished yet and am not sure when if I will get time in my current work to finish it.
  2. We definitely will need parallel walk if we use selector traversals backed by the Block Service in go-ipfs, cause then you're limiting i/o based on potentially the speed of bitswap. So we should for sure do this before in particular we deploy go-fetcher widely.
  3. We can definitely do work in parallel, but we do need to be careful in the context of graphsync not to maintain the traversal order in the response.
  4. Currently, disk perf in graphsync is most significant as a penalty on the write side for the person who makes the graphsync request. One interesting move would be to queue up these writes in a seperate thread -- they aren't currently. That is a seperate task we ought to explore independent of this. (i

hannahhoward avatar Mar 29 '21 23:03 hannahhoward

Last I checked, disk read performance on the response side is not a huge contributor to slowness in Graphsync, cause it's gated by backpressure on actually sending the messages on the network.

hannahhoward avatar Mar 29 '21 23:03 hannahhoward

Last I checked, disk read performance on the response side is not a huge contributor to slowness in Graphsync, cause it's gated by backpressure on actually sending the messages on the network.

I thought @raulk's investigations showed that replacing the datastore with an in-memory datastore made things much faster because we didn't have sequential random access?

Stebalien avatar Mar 30 '21 01:03 Stebalien