mumble WebRTC as a transport for voice

WebRTC as a transport for voice

Open Johni0702 opened this issue 5 years ago • 8 comments

This is an extension to #2131 (WebSocket for the control channel) in an effort to properly support purely browser-based Mumble clients.

Motivation

While WebSocket support alone would already allow for fully functional browser clients by tunneling voice over the control channel, the resulting implementation in the client is a huge mess (including but not limited to compiling the codec libraries to JavaScript, abusing ScriptProcessorNodes and most importantly hoping for the scheduler and GC to not get in the way). While the first of those just results in large JS blobs and bad performance, the latter two can (and will, if the system is under load) cause the audio to be randomly interrupted or delayed by as much as multiple seconds. It mainly boils down to the fact that handling real-time stuff in mostly pure JS on an ordinary web page is not a good idea.

Using WebRTC solves all of the above by pushing all of the real-time data handling to the browser. If all clients use the Opus codec, this can even be done in a backwards compatible way.

Overview of the relevant WebRTC internals

When talking about WebRTC, the part that is relevant to Mumble is mostly the protocols used and less the JavaScript APIs which one usually refers to. In this particular case those are STUN, ICE (not to be confused with the RPC lib used in Murmur, that's a completely unrelated thing), DTLS, SRTP and RTP (layered in that order).

RTP

https://tools.ietf.org/html/rfc3550 https://tools.ietf.org/html/rfc7587 This is the upper most layer used in WebRTC when transmitting real-time data (e.g. audio). RTP packets are usually sent over UDP (with some additional layers in between) and are in many aspects similar to the voice packets used by Mumble. Each data source is identified by its SSRC (Synchronization SouRCe) (similar to the session id in Mumble). A packet carries an SSRC, a timestamp (unit depends on codec, for Opus it's in samples, i.e. 48k/s), a sequence number, the actual data (e.g. audio) and other, less relevant data.

RTP has a companion protocol named RTCP which is used for transmitting metadata about SSRCs and reporting packet loss among other things but it's mostly irrelevant to Mumble (until Mumble does video).

SRTP and DTLS

https://tools.ietf.org/html/rfc3711 https://tools.ietf.org/html/rfc5764 The SRTP layer provides encryption and authentication to RTP packets. DTLS (TLS for UDP) is only used for the handshake and to establish key material for SRTP. One important conceptual difference between SRTP and Mumble's UDP crypto is that SRTP derives the key used for a particular packet from its SSRC and its sequence number whereas Mumble uses only the index of the packet which is dependent on neither the source of the packet nor the sequence number in that voice transmission. (Small detail: since the sequence number in RTP packets is only 16 bits, the SRTP implementation maintains an internal roll over counter which is also used in determining the key used)

A result of that difference is that some cryptographic information (e.g. replay list) needs to be retained for each SSRC for the whole session, so the number of used SSRCs should be kept low. (Another reason for keeping the number of used SSRCs low is that WebRTC mandates that MediaStreams cannot be removed, only set to inactive.)

ICE and STUN

These are used to establish an connection between two peers through NATs. In the Mumble case, one of the two peers is the server which needs to be publicly reachable anyway, so NATs shouldn't be much of a problem.

Proposed protocol changes

Unsurprisingly a few extensions to the Mumble protocol are required to use WebRTC as the voice transport.

To indicate support for WebRTC, a new field is added to the Authenticate messages:

// Whether to use WebRTC instead of native UDP packets.
optional bool webrtc = 6 [default = false];

If the server supports WebRTC and the client indicates its support with above flag, then the server must send initialization data for the WebRTC connection (similar to the CryptSetup message) before completing the connection via a ServerSync and before sending any UserState packets. This allows WebRTC-only clients to recognize old servers which do not support WebRTC and show an error message to the user.

SDP vs bare minimum

When building an application on top of WebRTC, one usually passes SDP messages between participating peers. From the application's point of view SDP messages are just blobs of data which are used by WebRTC to negotiate transport, codec and other settings. However, IMO the better approach for Mumble is to only pass the minimally required amount of information (i.e. the fingerprint of the dtls server and some data for ICE) and let the client construct the SDP itself (if it even needs to). The main reasons I'm against passing whole SDP messages are:

There's little to gain. SDP can negotiate tons of stuff but almost all of it is non-negotiable when compatibility with native Mumble clients is required.
Whenever a new user connects to the server, we need to register its SSRC to the browser's WebRTC implementation which involves a full round trip of SDP passing (and the size of SDP messages scales with the amount of users online).
There's a another abstraction (two, actually) on top of RTP until we're in JavaScript land which are used to map RTP streams to MediaStreamTracks. Their details don't matter to the packets going over the network and I see no reason why the server should be tasked with managing them. E.g. it would also make browser specific support harder (there are even two different ways of configuring them named "Plan B" and "Unified Plan", and while the latter is the standard, the former is still the default setting in Chrome).

The proposed initialization data referred to in the previous segment would therefore look as follows:

message WebRTC {
	optional string ice_pwd = 1;
	optional string ice_ufrag = 2;
	optional string dtls_fingerprint = 3;
}

Additionally (and this is the case with both approaches), ICE candidates need to be exchanged between client and server (these contain addresses and ports for the client and server to find each other at and to use for RTP passing):

message IceCandidate {
	required string content = 1;
}

Mapping SSRCs to users

As opposed to session ids, the total amount of SSRCs used should be kept low. As such a new field should be added to the UserState message which contains the SSRC used for the user:

// Unique SSRC from which the user's audio is sent when using WebRTC.
//
/ As opposed to `session`, this value must not be monotonically increasing
// but must instead be re-used where possible (i.e. after a user disconnects).
// The WebRTC implementation must keep track of all SSRCs ever used for the
// entirety of the WebRTC session, so that number needs to be kept low.
optional uint32 ssrc = 20;

Alternatively, the requirements on the session id could be changed to conform to the requirements on SSRC values. I'm not sure whether there are any other requirements which would conflict with the SSRC ones, so I've kept them separate for now. (I've also kept them separate for a practical reason: it doesn't require you to do session id re-mapping when building a proxy.)

SSRC 0 should be reserved for the client to be used when sending audio to the server (server loopback would then return on the SSRC indicated in the client's own UserState message). Note: The proof of concept implementation currently uses a random SSRC for serverbound audio which works as well until it randomly chooses a low SSRC and collides with one of the other users' SSRCs (it was just easier to implement).

Indicating talking state

Mumble voice packets contain a target ID field, i.e. for client-bound packets: normal, whisper, shout, loopback and for server-bound packets: normal, VoiceTarget, loopback There is no equivalent in RTP short of allocating multiple SSRCs for each user. Additionally, RTP has no equivalent for the last packet marker and there's no good way to determine whether a user is currently talking via the JavaScript API. The solution here is to move talking state indication into the control protocol (at least for WebRTC clients):

// Indicates whether a user is currently talking (or whispering or shouting).
// Also used to set own talking state.
// Only sent when WebRTC is used, otherwise this information can be deduced
// from the UDP packets.
message TalkingState {
	// User whose state this is
	optional uint32 session = 1;
	// Target, as used in UDP packets:
	// Clientbound: 0 is normal talking, 1 is shout, 2 is whisper, 31 is server loopback
	// Serverbound: 0 is normal talking, 1-30 as per VoiceTarget, 31 is server loopback
	optional uint32 target = 2;
}

When sent by the server, these are purely for display in the UI and should not influence audio processing in any way. When sent by the client, these indicate its intent to start/stop talking and the server should subsequently start/stop handling the RTP packets from the client (RTP packets might even be sent when the user isn't talking, though the client can make sure those contain silence only). A delay of these messages might result in some packets missing at the beginning of a user's own voice transmission, however this shouldn't be much of a problem as those packets would probably have been lost anyway if Mumble's Voice over UDP was used instead.

Note that this requires the server to track the current talking state of each user which it currently doesn't do (afaik). Doing so shouldn't require much processing power though and is requires anyway (see following point about Mumble to RTP translation).

Some further details

Translating Mumble Voice transmissions to RTP streams

An RTP stream is more persistent than a voice transmission in Mumble in the sense that a single transmission last from pressing the PTT key to releasing it whereas an RTP stream will last from initial connection of the user until they disconnect. This is requires as there is no quick way to add new RTP streams on demand whenever a user starts talking without using the control channel and introducing delays or loss of packets.

As such, multiple consecutive Mumble voice transmissions by the same user need to be stitched one after the other into the same RTP stream. The only thing to watch out for is that no huge jumps in RTP sequence number occur as that can cause the crypto to get out of sync. Other than that, this is rather easy to implement by just keeping an RTP sequence number offset and adding the Mumble sequence number on top. This will also transparently pass on any jitter in the packets. Note that Mumble's sequence numbers do not have to start at 0 though, so an additional offset needs to be kept to prevent huge jumps in resulting RTP sequence numbers.

Since the RTP timestamp for Opus is just the amount of samples passed, it can simply be calculated as 480 * rtp_seq_num. If the marker bit in the RTP header is set for the first RTP packet in each transmission, the client will deal alright with the discontinuity.

For a POC implementation in Rust, see here.

Translating RTP streams to Mumble Voice transmissions

This is far simpler than the other way around. The server merely has to store the current talking state and the rtp offset when the user started talking (TalkingState message) and can then convert from RTP to Mumble as one would expect.

For a POC implementation in Rust, see here.

Positional audio

As far as I am aware, WebRTC does not support positional audio. While RTP provides support for extending its header, WebRTC only supports a specific set of them and none provide anything like positional audio. So for now, positional audio will not be supported.

Multiple voice streams

While multiple outgoing streams for one client are technically supported by the Mumble protocol, I see no use case aside from bots which whisper different things to different people and those probably shouldn't be put in the browser. Since such bots continue to exist, this must be kept in mind when implementing the transmission tracking on the server (should be tracked per target+user, not just per user). RTP does not support multiple streams for one SSRC and as such only one stream at a time can be received per user and only one at a time can be sent by the client (I believe this matches the behavior of the native Mumble client, not entirely sure though).

Other codecs

It might be possible to support other codecs like CELT and Speex if the browser has support for them (RTP can support different codecs). I haven't yet looked into that though.

The POC only supports Opus and always assigns it the RTP payload type 97. If multiple codecs were to be supported, the proper way to do so would probably involves indicating support of and assigning payload type for each Codec in the WebRTC message or using the current mechanism of indicating codec support and defining fixed payload types for each Codec (as is done with Mumble's UDP voice protocol).

Proof of concept

Mumble/TLS/TCP to WebRTC/WebSocket/TLS proxy: https://github.com/Johni0702/mumble-web-proxy WebRTC support in mumble-web: https://github.com/Johni0702/mumble-web/tree/webrtc Lib used by mumble-web (the WebRTC happens here): https://github.com/Johni0702/mumble-client/tree/webrtc Demo: https://voice.johni0702.de/webrtc/?address=voice.johni0702.de&port=443/demo Also be aware that the comments in the .proto files (especially the ones about SDP) used in the POC might be out of date.

Dec 16 '18 15:12 Johni0702

This is a heroic task. I've been idly wondering whether mumble or webrtc are the best choices into the future, and here you are gluing them together. This is amazing my hats off to you.

May 03 '20 01:05 kousu

@Johni0702 if I understand this correctly you have started working on this feature in a branch of yours. What's the status of it?

May 03 '20 17:05 Krzmbrzl

@Krzmbrzl All I have done is listed under the "Proof of concept" section, i.e. a mumble-web version which uses WebRTC instead of UDPTunnel messages and a proxy which converts between WebSocket+WebRTC and TCP+UDPTunnel (assuming its running on the same machine as the server). IIRC, at time of writing of the issue, both of those were working as well as the normal mumble-web version (though haven't really had much testing at all).

I have not touched Murmur cause I was not (and am still not) particularly familiar with C++, especially given how much network-facing code would be involved.

May 03 '20 17:05 Johni0702

Okay thanks for the update :+1:

May 03 '20 17:05 Krzmbrzl

@Krzmbrzl Also interesting, grumble seems to be capable (or maybe have been capable) of using mumble-web without the proxy: https://github.com/mumble-voip/grumble/issues/33

This is really an interesting project, I planned to set it up for a friend who does not like to fiddle around with installing etc. But never tried it, also due to the still missing config support of grumble See: https://github.com/mumble-voip/grumble/pull/26

Update: Just read there seems to be a difference between the html5 version of mumble-web and a new webrtc branch:

Note: This WebRTC branch is not backwards compatible with the current release, i.e. it expects the server/proxy to support WebRTC which neither websockify nor Grumble do.

https://github.com/Johni0702/mumble-web/tree/webrtc

May 03 '20 17:05 toby63

WebRTC is interesting because it has a better echo-cancel feature then the old one in Mumble based on speex. With Pulseaudio you can test that feature by $ pactl load-module module-echo-cancel and it creates sink and source in PA. After testing it, it was as good as Mumble 1.1.x with 6 channels and output to Center (disabled Pos. Audio) and enabled Echocancelation (at least with ASIO).

We need the echo-cancel feature of WebRTC because it works in more than 1 channel. By the way as soon as output to Stereo Speakers Mumbles echo-cancel get worse.

https://forum.freifunk-muensterland.de/t/mumble-script/3695/14

May 20 '20 16:05 Chris2000SP

WebRTC is interesting because it has a better echo-cancel feature then the old one in Mumble based on speex.

Not necessarily true. As it turned out the echo cancellation in Mumble was broken. It is going to be fixed by #4167

May 20 '20 16:05 Krzmbrzl

Echo chancel (as in audio processing) is (in theory) independent from WebRTC (as in network protocol). WebRTC does not provide echo cancellation. Of course a Mumble web client that runs in Chrome can benefit from Chrome's echo cancellation and noise suppression (by the webrtc library), but it shouldn't be the reason to implement WebRTC (the protocol).

The reason to implement WebRTC: there is already a working web client (https://github.com/Johni0702/mumble-web) and it would be nice to have WebRTC support out-of-the-box without a special proxy in-between.

May 21 '20 18:05 streaps

mumble mumble copied to clipboard

WebRTC as a transport for voice

Motivation

Overview of the relevant WebRTC internals

RTP

SRTP and DTLS

ICE and STUN

Proposed protocol changes

SDP vs bare minimum

Mapping SSRCs to users

Indicating talking state

Some further details

Translating Mumble Voice transmissions to RTP streams

Translating RTP streams to Mumble Voice transmissions

Positional audio

Multiple voice streams

Other codecs

Proof of concept

mumble
mumble copied to clipboard