js-waku icon indicating copy to clipboard operation
js-waku copied to clipboard

feat: SDK for redundant usage of filter/lightpush

Open fryorcraken opened this issue 1 year ago • 12 comments

Planned start date: Due date:

Summary

Implement a scoring or other mechanism to enable js-waku nodes to:

  1. Rely on random internet peers with minimal degradation of the experience
  2. Subsequently, save peers in local storage and use them upon start-up

Implementing (2) without (1) would mean that upon start up, a node would not connect to bootstrap (Waku fleet) peers but previously found peers. Such peers may not be reliable and could lead to a full degradation of the experience. A js-waku needs to determine whether it can avoid using bootstrap peers.

Also note:

  1. Usage of bootstrap peer should still be done for store service until we have distributed service
  2. peers passed as static list should be considered as bootstrap peers

Acceptance Criteria

  • [x] A js-waku node can use services (filter, light push) from several remote nodes at the same time: #1779
  • [ ] Some (scoring) mechanism to enable a local peer to determine whether a remote peer is reliable enough to be used for filter and light push services
  • [x] Save peers in local storage and use them upon next restart for filter and light push, if they are deemed reliable, to increase decentralization and reduce load of bootstrap fleet.

Notes

To ensure the API consumers does not receive duplicate messages when several nodes are used for filter, caching of message (MUID) will be necessary.

Tasks

  • [x] https://github.com/waku-org/js-waku/issues/1606

RAID (Risks, Assumptions, Issues and Dependencies)

  • Depends on @waku-org/research to help/deliver the scoring/other logic.

fryorcraken avatar Aug 08 '23 04:08 fryorcraken

Some idea for a logic: https://github.com/waku-org/js-waku/issues/914#issuecomment-1668882251

fryorcraken avatar Aug 08 '23 04:08 fryorcraken

@danisharora099 to check a way to understand how reliable a peer (scoring) is by using existing nwaku API (possibly libp2p's protocol)

weboko avatar Aug 10 '23 10:08 weboko

@danisharora099 Shall we add a latency check as part of this milestone where we select the peers with lowest latency. May be we even have a logic that pings every new peer via PX and if a faster peer is found we start to use it (in addition to other peers).

Maybe latency can be part of some scoring mechanism? not sure

fryorcraken avatar Aug 15 '23 07:08 fryorcraken

Great initiative to look at some of these questions, especially as it relates to filter usage! Filter relies in many ways on the same building blocks as relay for its reliability, but in a modular, "pick your own tradeoffs" way:

  • redundancy (for relay in full message connections, for filter in subscriptions)
  • randomness (selecting random peers for connection/subscription, preferably with some peer cycling)
  • periodically checking that you received all messages against a cache (this doesn't really exist yet for filter, but you could imagine using occasional store queries to achieve something similar)

As such it will be helpful to provide a configurable "reliability" SDK on top of filter for projects without the scope to build these features from the ground up with filter.

  • A js-waku node can use services (filter, light push) from several remote nodes at the same time.

Indeed. For now I'd suggest just selecting random nodes in the network as filter/lightpush peers, with some redundancy factor built in.

  • Some (scoring) mechanism to enable a local peer to determine whether a remote peer is reliable enough to be used for filter and light push services

I wouldn't necessarily bring scoring into this. Relay/gossipsub, for example, simply choose to eventually disconnect from peers that provides less value than others (peer scoring may be too long-lived and complex if there's simply a temporary connectivity issue). You could for example have n filter subscriptions and periodically review if some peers have "missed" more messages than others and cycle those.

  • Save peers in local storage and use them upon next restart for filter and light push, if they are deemed reliable, to increase decentralization and reduce load of bootstrap fleet.

I wouldn't imagine that the DNS lookups, followed by initial peer-exchange should take very long. It's probably a good idea to cache some peers, but I would try to flush out that cache as soon as possible after a startup and replace each of these subscriptions with a new one to a random node. This is to prevent a node from always using the same peers and so being vulnerable to bias.

Note that @siphiuel has been doing similar work on filter for status-go, so definitely worth getting his input here. :)

jm-clius avatar Aug 15 '23 17:08 jm-clius

@jm-clius agree with your overall idea, thanks for the comment!

re:

randomness (selecting random peers for connection/subscription, preferably with some peer cycling)

we decided to use the peer with the lowest ping for this, with the aim of having fastest responses to protocol requests so not sure how useful randomness is in the context of js-waku perhaps, the strategy can be to increase the score of the node with the lowest peer for js-waku cc @fryorcraken

danisharora099 avatar Oct 17 '23 09:10 danisharora099

I'd suggest to follow @jm-clius 's recommendation here and not introduce scoring. I think prioritizing nodes with lowest latency first makes sense. Then, if nodes are unreliable, we can disconnect and use another node.

fryorcraken avatar Oct 20 '23 04:10 fryorcraken

attributes that could contribute to defining "reliability":

  • remote peer should have relay enabled
  • latency
  • number of times a remote peer has dropped a connection with us
  • peers discovered through peer-exchange
    • this also includes deprioritizing local storage peers in favour of peer-exchange peers

rough implementation (needs improvement): whenever a protocol request is initiated:

  1. get all the peers connected
  2. check that they support relay (prioritize these peers, for the remaining "seats" use other peers)
  3. sort them by their latencies & reliability gauged by their # of disconnections
  4. use the top N peers to send the protocol request
  5. observe these N peers,
    • if any of them prove to be "unreliable", ie, unable to process (?) our request, or sends a faulty response
    • deprioritize them, and cycle with a new peer

cc @waku-org/research @fryorcraken

danisharora099 avatar Oct 20 '23 12:10 danisharora099

attributes that could contribute to defining "reliability":

* remote peer should have relay enabled

* latency

* number of times a remote peer has dropped a connection with us

* peers discovered through peer-exchange
  
  * this also includes deprioritizing local storage peers in favour of peer-exchange peers

IMO the most important criteria is missing from the list:

  • Push the same or more messages than other peers on filter subscription
  • does not return error when doing a filter request such as ping
  • does not return error on light push requests

fryorcraken avatar Oct 24 '23 05:10 fryorcraken

action plan:

  1. if cache does not exist on startup:
  • DNS lookup, Peer Exchange & connect to fastest peers
  • cache peers in local storage
  • periodically "review" peers for reliability & if a node is "unreliable", swap it out for a random node
    • update cache if necessary
  1. if cache exists on startup:
  • connect to the cached peers
  • once connections are established, flush out the cache & use to the new "fastest peers"
  • periodically "review" peers for reliability & if a node is "unreliable", swap it out for a random node
    • update cache if necessary

PRs:

  • [x] use multiple peers for lightpush & filter instead of just one (currently): #1779
  • [x] introducing caching/local storage as a discovery module & storing peers: https://github.com/waku-org/js-waku/pull/1811
  • [ ] connecting & using cached peers, cycling with fastest peers once established, updating cache

The scope of unreliability can be tackled as a followup PR

cc @jm-clius @waku-org/js-waku-developers please let me know if you have thoughts

danisharora099 avatar Jan 10 '24 09:01 danisharora099

3. if cache exists on startup:

* connect to the cached peers

* once connections are established, flush out the cache & use to the new "fastest peers"

What peers? do you mean you do DNS discovery and peer exchange?

* periodically "review" peers for reliability & if a node is "unreliable", swap it out for a random node
  
  * update cache if necessary

fryorcraken avatar Feb 06 '24 03:02 fryorcraken

What peers? do you mean you do DNS discovery and peer exchange?

With "cache existing on startup" means the nodes that we were previously able to connect to healthily, and are stored in our local storage. We connect to them, run PX on them, find new peers and eventually remove them and add these new found peers so we don't keep reusing the same peers to connect to.

danisharora099 avatar Feb 07 '24 06:02 danisharora099

remainder:

  • [ ] cycling of peers in local storage after startup, when new peers are discovered
  • [ ] disconnection from unreliable peers and connection to new ones for light protocols (to be tackled after #1886)

danisharora099 avatar Mar 06 '24 10:03 danisharora099