snapcast icon indicating copy to clipboard operation
snapcast copied to clipboard

Better documentation of the protocol

Open douardda opened this issue 3 years ago • 5 comments

Hi, I know you've made efforts recently to write some documentation for the protocol used in snapcast and this is very valuable. But there are aspects I still miss; I am currently trying to write an esp32 snapcast client using the esp-ADF framework, partially using the work from @jorgenkraghjakobsen in https://github.com/jorgenkraghjakobsen/snapclient (using a ESP32-s2-soala boad + a pcm5142 DAC on a small break board (made by polyvection).

I've finally reached the point where it can play sound from the snapcast stream, but only using pcm or flac codecs, I have never been able to make the opus nor ogg codec work for now.

My questions on the protocol are:

  • what is the format of the payload of Codec Header message. Thanks to the work of @jorgenkraghjakobsen I see what it's supposed to for the opus codec, but not for flac, ogg or row pcm, I haven't yet figured what it's supposed to be digging in your source code.
  • I don't really understand the time synchronization protocol; (and first it took me a bit of time to understand that timestamps sent by the server are not related to EPOCH but are the number of seconds since the boot time. So there is no need for the client to have an accurate date/time, using ntp for example). Then how is the network latency estimated? For example in the formula latency_c2s = t_server-recv - t_client-sent + t_network-latency where does the network latency comes from? how is it estimated. Then the doc says "Calcutates the time diff between server and client as [...]" but I don't understand how to use this value afterward to adjust my player's speed to resync with the server (note that I have for now no clue on how to adjust the playing speed on the esp32, but that's another story).
  • so I'm not sure what value to put in the latency field of the Time message I send (once a second). I see in your code an initial time sync is done (sendTimeSyncMessage(50);) but I don't understand what's this 50 value for. Also in the sendTimeSyncMessage it looks to me you send a constant 2s latency. So rereading the doc, I believe this value is actually ignored by the server which only cares about the t_client-sent value, is this right? The server only sends a Time response to the client with latency = t_server-recv - t_client-sent, right?

Sorry for asking so many questions... I'll be happy to provide a PR for this doc once I do understand all this a bit better.

David

douardda avatar Jan 28 '21 00:01 douardda

  • I've added some information about codec headers in binary_protocol.md. It't basically whatever comes out of the encoder and is fed one to one to the decoder
  • the time sync algorithm is also described there. The network latency is the network latency, it's the value that ping returns. It's eliminated by the assumption that it's symmetric in uplink and downlink.
  • The 50 is just for an initial time sync. To measure 50 times the (quite noisy) latency, to have some solid median latency right from the start. It's not part of the protocol, but an implementation detail

badaix avatar Feb 19 '21 09:02 badaix

@badaix thanks for the additions for codec headers.

For the time sync algorithm, I think what I still miss is to figure how the client is supposed to know at which timestamp each chunk is expected to be played by the client to be in sync with other clients. The wire chunk comes with a timestamp described as "the timestamp when this part of the stream was recorded", so with the time sync algo, I can deduce the equivalent timestamp in the client's time base, but how is the client supposed to know when this particular TS is expected to be played.

The server settings message payload could also benefit from a bit of clarification. The example in the doc list a "bufferMs" value which I guess correspond to the 'buffer" config option of the snapserver config file, but unless I missed it, this option is not described either. What does this buffer value mean exactly? How a client is expected to react to this value? Same for the "latency" value of the Server Settings message (at least in the example given in the doc).

My understanding is that the time delay corresponding to bufferMs value is the the delay at which a wire chunk is expected to be played by the client, so a wire chunk with timestamp ts1 (in server's timebase) is expected to be played by the client at ts1+bufferMs (using the time sync algo to find out how to translate this timestamp in client's timebase). Is this correct?

douardda avatar Feb 22 '21 17:02 douardda

This is also my understanding. How that is done is an implementation detail. In the snapcast client there an estimated of delay from when PCM data is written to he audio DMA buffer and an adjustable buffer to time the synced playback. In my implementation I isolate the front end that parse the snapcast messages and the back end that keep audio chunks in sync. To do that I have to pass on the audio chunk timestamp on my audio buffer. Concept works very well and is less sensitive to network jitter.
/Jørgen

jorgenkraghjakobsen avatar Feb 22 '21 17:02 jorgenkraghjakobsen

Yes, ServerSettings::bufferMS corresponds to [stream] buffer in snapserver.conf which is the overall latency from capture to being audible. The server captures chunks with a duration of [stream] chunk_ms and tags this chunk with the local server time t_s (the time base doesn't matter, Snapserver was using the system time and eventually switched to some monotonic clock, which is a strictly monotonically increasing time since the last boot, and thus being robust against NTP or other system time corrections). Depending on the codec the chunk duration might change, in this case the encoded chunk will still be tagged with the record timestamp. The client is continuously sending timesync messages to get the delta d between server time t_s and client time t_c. So the server time can be calculated on the client as t_s = t_c + d. The snapclient sends a sync message every second and uses the median of the last 50 time deltas as delta d. The median is used to eliminate the network jitter. So a received and decoded chunk with timestamp t_s1 must be audible at t_s1 + bufferMs, which is t_s1 + d + bufferMs in the client time domain. Snapclient is using the same monotonic clock as used on Snapserver, while Snapweb is using the timestamp from the WebAudio API, because it's the most precise available clock. You might also have a look into Snapweb for an example of a single file client implementation (~1000 lines of code). There is also the ServerSettings::latency, which is a device specific latency configured by the user. This latency is subtracted from the playout time on the client, i.e. from the bufferMs, yielding t_s1 + d + bufferMs - latency as client sided play-out timestamp.

Edit: this is the whole magic behind Snapcast. The quality of the sync depends on the accuracy of the estimated DAC latency, i.e. from writing into the DACs ringbuffer until it's audible. The closer you are to the hardware, the better the estimation should be. You can for example always ask alsa for the current latency, which is working quite well for different DACs, while for WebAudio you can only tell the API when the chunk should be played out, resulting in a less accurate sync. This can be tweaked with the per client ServerSettings::latency.

PPS: I might change the current proprietary opus header to the Ogg Opus file header sometime.

badaix avatar Feb 22 '21 20:02 badaix

@badaix thanks a lot

douardda avatar Feb 23 '21 17:02 douardda