peer-pad icon indicating copy to clipboard operation
peer-pad copied to clipboard

RFC: an architecture towards peer-pad v1.0

Open pgte opened this issue 7 years ago • 1 comments

Current problems

Use of websocket-star

Currently, we use a websocket-based protocol to render 2 things:

  • Discovery: to find new peers interested in the app
  • Relay: to be able to send messages to these peers, we relay them via this websocket-star server.

The problems are: One server is required for peers to find each other. One server is a single point of failure. One server has to handle all the traffic between peers. This has always mean to be a temporary solution.

Use of a custom representation

In the past, in operation-based CRDTs, we dabbled with representing them using IPLD. This hasn't worked for us, not because of IPLD, but because deep narrow trees (as required by the operation-based CRDTs) are hard to sync efficiently.

To circumvent this and other problems, we switched to delta-CRDTs (they naturally support snapshotting) and vector clocks (VCs make it easy to compute the differences and just send the differences in one compact delta). Now nodes are able to get in sync without having to replicate the entirety of a big log tree. Instead, they get a full-state snapshot from a neighbour node and get deltas as the new changes arrive.

The main problem with not using an IPLD representation is persistence.

If you want to create a pad, add some text and then share it with a friend, you need to be online when that friend opens the pad, otherwise the content is not available in the network. Sure, peer-star now has a pinning service that tracks collaborations and persists them, but that's, again, a temporary solution. We should be able to leverage the IPFS infra-structure and pinning so that a) we don't need to attend to special-purpose services and b) allow the pinning market to grow autonomously with the rising tide of IPFS as a whole.

We need to represent a CRDT in IPLD to allow transparent pinning.

An architecture towards Peer-pad (and peer-star) v1.0

To address these concerns above and given new evolutions in js-ipfs and go-ipfs, we here propose a new architecture for these parts of Peer-pad (through changing these aspects of peer-star):

Discovery

Instead of using the websocket-star (or wrtc-star) service, we should change to the rendezvous protocol for discovery.

Rendezvous spec proposal

This service, being implemented by go-ipfs, would allow a peer to connect to any given node in a set of well-known nodes, and discover other peers interested in the same app or collaboration as them.

Collaboration Membership

Peers in a collaboration need to form a list of member belonging to that collaboration. So far we have used an application-specific pubsub channel to broadcast these changes to membership as they occurred.

Instead, use the rendez-vous protocol to register membership.

Connectivity

The new connectivity bootstrap flow should be something like this:

  • [ ] Using the rendezvous protocol, discover circuit relay nodes to which we can connect to.
  • [ ] Connect to a circuit relay node, obtaining a new multi-address
  • [ ] When necessary, register membership in a given collaboration using this new multi-address

Persistence

To achieve high persistence and leverage the existing and new IPFS infrastructure, we need to:

  • [ ] Represent and store snapshot and deltas in IPLD.
  • [ ] Internally, index these by vector clock (to enable fast diff and sync).
  • [ ] Create a tracking service that, for an app, finds and participates (as a listener) on all collaborations, pinning snapshots to an IPFS cluster.

Quality Assurance

We need to make sure the user experience is good. For this, we need:

  • [ ] Infrastructure monitoring and reliability (from the IPFS infra team)
  • [ ] automate swarm tests that resemble real users as close as possible

automate swarm tests that resemble real users as close as possible

Create a test suite that spawns multiple browsers running peer-pad, testing for multiple use cases.

These tests should measure:

  • [ ] User responsiveness: app should not become irresponsive to user input, when all users in a collaboration keeping making mutations. (event loop lag < 300ms)
  • [ ] Correctness: peers should not go out of sync, i.e., after all stopping edits, they should all conerge to the same value. and not loose data!
  • [ ] Latency: sync time should be low (less than 5 seconds for 10-peer collaboration).
  • [ ] Discovery: make sure every peer in a collaboration discovers each other in a short amount of time (< 5 seconds).
  • [ ] Offline support: make sure offline peers are able to mutate state and then these mutations sync correctly to the other members once they go connect. It should be seamless, not requiring user intervention.
  • [ ] Collaboration scalability: test exist for all the above with 5 to 50 peers in one collaboration
  • [ ] Application scalability: An app should scale to 1000 peers, each one in a 2-peer collaboration
  • [ ] Stability: Peer-Pad supports super focused meetings of 30 minutes (needs to support at least 10 people writing frenetically at the same time)

Besides these metrics, the tests should be able to extract relevant diagnostics from the runtimes:

  • CPU usage
  • Connections
  • Errors
  • App logs
  • ...?

pgte avatar Sep 27 '18 11:09 pgte

@diasdavid could you please review this, aligned with Q4 OKRs?

pgte avatar Sep 27 '18 11:09 pgte