raiden icon indicating copy to clipboard operation
raiden copied to clipboard

[META] Presence less WebRTC Connection mechanism

Open fredo opened this issue 4 years ago • 2 comments

Problem

WebRTC relies on the presence state of users to start a connection trial phase or retry. As we are moving towards a temporary presence less Raiden client, there needs to be another mechanism to connect to webRTC without the need of knowing the presence.

Proposal

The idea is to go away from a fixed asynchronous channel creation where one partner is always the caller and the other the callee. There are three requirements when two nodes want to create a webrtc connection with each other

  • They have the following relationship with each other
    • channel neighbours
    • initiator/target
  • They have webRTC enabled
  • They are both online

Introduction

In the earlier approach, we had a strict scheme/protocol how to create webRTC channels. The above requirements should be met and there is was a specific channel partner to initiate the signalling process. Thinking of what the actual goal is by this protocol, we want to establish a network of webRTC channels which somewhat align with the Raiden Network (and maybe a bit more). The strict set of requirements introduced the problem that a raiden node must monitor if all requirements are met at all times and if necessary start a connection trial. This introduced a lot of race conditions which needed to be taken into account. Especially, Channels which are not used frequently, are probably not so important that it would be necessary to constantly monitor the webRTC channel status. On the other side, frequently used channels should be monitored more often as it increases performance of the overall network better.

Channel creation responsibility

Instead of having a particular address being responsible for opening a webRTC channel, as a base rule it should be possible for any node to open a webRTC channel with a peer of interest. It is then up to the peer if the request should be accepted. Going away from the old scheme, it is now possible for any node to create a channel based on the event why this should happen.

Example: Peer coming online

In the old scheme, the peer with the lower address would always be responsible for opening a channel. If the higher address is offline, the lower address would constantly have to monitor if the node came back online. It would be much simpler to give the responsibility to initiate the signalling process to the higher address. When he comes online, he is responsible for creating a channel. No monitoring by the other address needed.

Allowing multiple channels with the same peer

There are a lot of events which can tear down or trigger the creation of webRTC channels. Often these are external events which are not deterministic and the combination of these events to happen is endless. Allowing each client to create webRTC channels, a typical race condition could be that both peers feel responsible to open the channel. In the webRTC environment, this situation is referred to as glare. Handling this situation can be simply avoided by allowing the creation of two webRTC channels. As long as messages from both channels are accepted by the receiver this should not be a problem. If one channel breaks, the using peer could switch over to the other one. Further optimizations can be met later on, how to handle two webRTC channels.

Health Check of the channel's lifeness

Introducing the presenceless Raiden, a raiden node typically do not know if a peer is on/offline. Thus it cannot be sure upon connection error if it was a connectivity problem or the peer went offline. Note that this information can be fetched with the address metadata. Knowing that address metadata are fetched on other occasion (principally when sending a message), we could use that information on demand to check the lifeness of the corresponding webRTC channel. What does that actually mean? It means, that whenever I want to send a message via the channel it means I have to just have received the corresponding metadata. Otherwise, we can assume that the node is being offline. This information would then immediately trigger a health check on the webRTC channel or a signalling initiation if the channel is broken or non-existent. That means, the more often a channel is used the more often it will be checked for lifeness. This is important because frequently used channels are considered more valuable in terms of performance so it'd make sense to check them more often. If a not so frequently used channel breaks, there will be a fixed number of retries. If then there is still no connection possible the peer is considered to be offline.

When to create channels

Here is a list of when a channel creation should be triggered. This list can change over time

  • coming online and a channel partner is already online
  • channel open event by the channel opener
  • initiator to the target
  • Address metadata of channel partner received but no channel online

Open questions

  • The caller will perform retries if the connection could not be established. How does the node know if it can stop retrying because the node is offline? A fixed number of retries should be the limit.

  • How should the callee react to webRTC channel creation? With the above proposal it might be the case that the reason for channel creation attempt by the caller is not known (yet) by the callee. I.e. Haven't seen ChannelCreationEvent yet. In this case, should the callee simply accept any channel creation? IMO this is okay as the user would could also be spammed by to-device messages. Each client implementation could decide whether to block channel creation attempts or not. On the other hand, it helps to establish a network of webRTC channels faster if we simply try to accept all channels and handle possible attacks later

  • The Perfect Negotiation scheme might be useful in the future to handle glares (two connection attempts from both sides). (i.e. The lower address is always the impolite partner)

Old Notes

Even if it is not supposed to happen but it could be possible that both nodes try to establish a connection. There should be no problem if the clients accept the creation of both channels and communicate separately over it. As long as both clients accept messages from both channels this should be fine.

Even if it is not supposed to happen but it could be possible that in some weird scenario none of the clients tries to create a channel. Then we could recheck channel creation on demand whenever we fetch metadata for the partner. This could be direct payments or payments where we route over the partner. In this scenario the network would check it's healthiness and recover with the payments itself. The more the network is used the more it would recheck its health state. If we use that mechanism, nodes could eventually give up retrying channel creation and being woken up whenever there is activity on the channel again

fredo avatar Feb 26 '21 09:02 fredo

We have implemented this approach in https://github.com/raiden-network/light-client/pull/2683 / https://github.com/raiden-network/light-client/commit/581e2883c46892fc38708f100b7c403dc4a04287, and it works great so far! I copy here the relevant part of that PR's description:

Also, WebRTC now has a new signaling protocol, backwards compatible with past protocol:

  • All nodes now acts as both caller and callee: i.e. removed the fixed ordering based on sorted addresses
  • Callee starts listening peer's messages as soon as peer is whitelisted, and will try to answer as soon as offer is received; it uses/respects callId (channel.label) sent by caller, and don't mind caller's webRTC capability
  • Caller start to try to call peer upon certain events of interest
    • Currently:
      • channelMonitored (startup partners, new channel opened partners)
      • transferSigned (transfer's initiator, target; can be removed/narrowed later if we can exchange secrets without a direct RTC channel)
      • messageSend.request (to retry when trying to message a peer which we don't have an RTC channel with yet)
      • rtcChannel with payload=undefined (previous/last RTC channel closed)
    • Caller always retrieve presence from PFS, errors if peer don't have webRTC capability set
    • Caller retries a limited number of times, expo backoff 5s-60s delays, ~6 retries, then give up and retry again iff a new event from above is emitted
  • At all times, for each peer, callee and caller channels race, the winner cancels the other's attempt to establish the channel.
  • Since callee is always listening, peer's caller could retry and would still be able to get a new channel through
  • If a node is online for some time, its caller should have given up; a partner coming online will then get their caller handler calling, so it always get almost instantly (except for matrix's toDevice message delays) connected
  • If both nodes are coming online at the same time, and they get each other's presence as offline at first, caller's retry loop will get them connected after ~5s at most, so races are always gracefully handled
  • This protocol also work for a matrixless transport, where one side can't receive calls (maybe a web dApp which can't expose a REST endpoint for such) but also is always expected to connect to full node's partners which are always online and have such open endpoints
  • This is backwards compatible with previous WebRTC algorithm, because since it acts as both caller & callee, the previous transport requiring/assuming a fixed role will always succeed.

andrevmatos avatar May 04 '21 20:05 andrevmatos

  • [x] #7151
  • [x] #7181
  • [x] #7189
  • [ ] #7158

istankovic avatar Jun 24 '21 13:06 istankovic