snapcast
snapcast copied to clipboard
Better documentation of the protocol
Hi, I know you've made efforts recently to write some documentation for the protocol used in snapcast and this is very valuable. But there are aspects I still miss; I am currently trying to write an esp32 snapcast client using the esp-ADF framework, partially using the work from @jorgenkraghjakobsen in https://github.com/jorgenkraghjakobsen/snapclient (using a ESP32-s2-soala boad + a pcm5142 DAC on a small break board (made by polyvection).
I've finally reached the point where it can play sound from the snapcast stream, but only using pcm or flac codecs, I have never been able to make the opus nor ogg codec work for now.
My questions on the protocol are:
- what is the format of the payload of Codec Header message. Thanks to the work of @jorgenkraghjakobsen I see what it's supposed to for the opus codec, but not for flac, ogg or row pcm, I haven't yet figured what it's supposed to be digging in your source code.
- I don't really understand the time synchronization protocol; (and first it took me a bit of time to understand that timestamps sent by the server are not related to EPOCH but are the number of seconds since the boot time. So there is no need for the client to have an accurate date/time, using ntp for example). Then how is the network latency estimated? For example in the formula
latency_c2s = t_server-recv - t_client-sent + t_network-latency
where does the network latency comes from? how is it estimated. Then the doc says "Calcutates the time diff between server and client as [...]" but I don't understand how to use this value afterward to adjust my player's speed to resync with the server (note that I have for now no clue on how to adjust the playing speed on the esp32, but that's another story). - so I'm not sure what value to put in the
latency
field of the Time message I send (once a second). I see in your code an initial time sync is done (sendTimeSyncMessage(50);
) but I don't understand what's this 50 value for. Also in the sendTimeSyncMessage it looks to me you send a constant 2s latency. So rereading the doc, I believe this value is actually ignored by the server which only cares about thet_client-sent
value, is this right? The server only sends a Time response to the client withlatency = t_server-recv - t_client-sent
, right?
Sorry for asking so many questions... I'll be happy to provide a PR for this doc once I do understand all this a bit better.
David
- I've added some information about codec headers in binary_protocol.md. It't basically whatever comes out of the encoder and is fed one to one to the decoder
- the time sync algorithm is also described there. The network latency is the network latency, it's the value that ping returns. It's eliminated by the assumption that it's symmetric in uplink and downlink.
- The 50 is just for an initial time sync. To measure 50 times the (quite noisy) latency, to have some solid median latency right from the start. It's not part of the protocol, but an implementation detail
@badaix thanks for the additions for codec headers.
For the time sync algorithm, I think what I still miss is to figure how the client is supposed to know at which timestamp each chunk is expected to be played by the client to be in sync with other clients. The wire chunk comes with a timestamp described as "the timestamp when this part of the stream was recorded", so with the time sync algo, I can deduce the equivalent timestamp in the client's time base, but how is the client supposed to know when this particular TS is expected to be played.
The server settings message payload could also benefit from a bit of clarification. The example in the doc list a "bufferMs" value which I guess correspond to the 'buffer" config option of the snapserver config file, but unless I missed it, this option is not described either. What does this buffer value mean exactly? How a client is expected to react to this value? Same for the "latency" value of the Server Settings message (at least in the example given in the doc).
My understanding is that the time delay corresponding to bufferMs value is the the delay at which a wire chunk is expected to be played by the client, so a wire chunk with timestamp ts1 (in server's timebase) is expected to be played by the client at ts1+bufferMs (using the time sync algo to find out how to translate this timestamp in client's timebase). Is this correct?
This is also my understanding. How that is done is an implementation detail.
In the snapcast client there an estimated of delay from when PCM data is written to he audio DMA buffer and an adjustable buffer to time the synced playback.
In my implementation I isolate the front end that parse the snapcast messages and the back end that keep audio chunks in sync. To do that I have to pass on the audio chunk timestamp on my audio buffer. Concept works very well and is less sensitive to network jitter.
/Jørgen
Yes, ServerSettings::bufferMS corresponds to [stream] buffer
in snapserver.conf
which is the overall latency from capture to being audible. The server captures chunks with a duration of [stream] chunk_ms
and tags this chunk with the local server time t_s
(the time base doesn't matter, Snapserver was using the system time and eventually switched to some monotonic clock, which is a strictly monotonically increasing time since the last boot, and thus being robust against NTP or other system time corrections).
Depending on the codec the chunk duration might change, in this case the encoded chunk will still be tagged with the record timestamp.
The client is continuously sending timesync messages to get the delta d
between server time t_s
and client time t_c
. So the server time can be calculated on the client as t_s = t_c + d
. The snapclient sends a sync message every second and uses the median of the last 50 time deltas as delta d
. The median is used to eliminate the network jitter.
So a received and decoded chunk with timestamp t_s1
must be audible at t_s1 + bufferMs
, which is t_s1 + d + bufferMs
in the client time domain.
Snapclient is using the same monotonic clock as used on Snapserver, while Snapweb is using the timestamp from the WebAudio API, because it's the most precise available clock.
You might also have a look into Snapweb for an example of a single file client implementation (~1000 lines of code).
There is also the ServerSettings::latency, which is a device specific latency configured by the user. This latency is subtracted from the playout time on the client, i.e. from the bufferMs, yielding t_s1 + d + bufferMs - latency
as client sided play-out timestamp.
Edit: this is the whole magic behind Snapcast. The quality of the sync depends on the accuracy of the estimated DAC latency, i.e. from writing into the DACs ringbuffer until it's audible. The closer you are to the hardware, the better the estimation should be. You can for example always ask alsa for the current latency, which is working quite well for different DACs, while for WebAudio you can only tell the API when the chunk should be played out, resulting in a less accurate sync. This can be tweaked with the per client ServerSettings::latency.
PPS: I might change the current proprietary opus header to the Ogg Opus file header sometime.
@badaix thanks a lot