webrtc-stats
webrtc-stats copied to clipboard
End-to-end delay metrics
End-to-end delay refers to the time it takes between the capture of a video frame or audio sample and the playout of that frame or sample at another endpoint. This includes one-way delay estimation as well as sender- and receiver-side buffering. roundTripTime/2
is a crucial part of the estimation, but is not sufficient to obtain E2E delay as it only accounts for the last hop, and there could be servers in-between the sender and the receiver. Fortunately, RTP timestamps and RTCP Sender Reports mapping RTP to NTP are all the puzzle pieces needed to solve this problem, regardless of number of hops. This assumes that relay servers are not giving garbage RTP to NTP mappings to the receiver.
This was previously discussed years ago but was not resolved. It is important that the spec-compliant getStats() provide an alternative to Chrome's legacy callback-based getStats() API.
How to calculate E2E Delay For now, let's only focus on a Sender and a Receiver. We have...
- RTP packets with RTP timestamps: We want to know "how long ago was this captured?". The RTP timestamp represents the capture time, but the RTP timestamp has an arbitrary offset and a clock rate defined by the codec.
- RTCP packets giving us RTT measurements: The RTT/2 is used to estimate the one-way delay from the Sender.
- RTCP packets giving us the offset allowing us to convert RTP timestamps to Sender NTP time. When extrapolating with the time passed since the packet was ready to be played out we get estimatedPlayoutTimestamp.
- RTCP packets giving us the Sender NTP time that the RTCP packet was sent.
Calculations:
The clock difference between the Sender NTP time and the Receiver NTP time is estimated at the Receiver by looking at the Receiver NTP time when the RTCP packet is received, subtracting by the Sender NTP timestamp to get the difference and adding RTT/2 to account for the time passed between sending and receiving. To avoid jittery values, a smoothed RTT value should be used based on multiple RTT samples. When receiving an RTCP Sender Report:
estimatedNtpDelta = reportReceivedInReceiverNtp - reportSentInSenderNtp -
smoothedRtt/2
When calculating estimatedPlayoutTimestamp we should also calculate e2eDelay:
playoutTimeInReceiverNtp = current time according to local NTP clock
estimatedPlayoutTimestamp = calculate according to spec
estimatedPlayoutTimestampInReceiverNtp = estimatedPlayoutTimestamp +
estimatedNtpDelta
e2eDelay = playoutTimeInReceiverNtp - estimatedPlayoutTimestampInReceiverNtp
But what if there is a relay Server between the Sender and Receiver? In this case, the Sender is actually a Server and the "Sender NTP timestamps" are actually the Server NTP timestamps.
The Server is sending both RTP packets (relayed) and RTCP packets, including Server NTP timestamps and how to map the RTP timestamps to Server NTP time.
It is thus the Server's responsibility that RTP timestamps can be mapped to the correct NTP timestamp. This requires the Server to provide RTP->NTP mappings accounting for the difference between the original Sender's NTP clock and the Server's NTP clock, including taking Sender-Server one-way delay estimates into account. The timestamp is converted from Sender clock to Server clock, and the Receiver does not have to care if there was a server in-between or not.
Since the Server bakes in its own delay estimates into the timestamp rewrite, the resulting e2eDelay will be for the entire trip - not just the RTT/2 of the last hop.
Note: If the Server is relaying contributing sources, the RTP timestamps no longer meaningfully map to capture time because they need to keep incrementing even if sources changes, and the RTCP packets mapping RTP timestamp to NTP timestamp are infrequent. In this case, the estimatedPlayoutTimestamp would be unreliable, and thus so would the e2eDelay estimation.
Proposal
Add RTCInboundRtpStreamStats.estimatedEndToEndDelay defined according to the above calculations of e2eDelay
.
This proposal does not touch on how to smooth the RTT values, but leaves that up to the implementation.
@alvestrand @vr000m
Edited the description with a correction: It is not the RTP timestamp that is rewritten by the server to account for the Sender->Server delay - it is the RTP->NTP mapping that adjusts for that difference. So this metric works as intended. However, when using contributing sources, we no longer have meaningful mappings between RTP timestamp and capture timestamp. In this case you either wouldn't have the RTCP information to do the mapping or the mapping you did have could become obsolete when the CSRC changes and you haven't gotten a new mapping.
I see several issues with this stat (although as I said before, the more stats the better).
On congestion paths, rtt/2 is not a good proxy for one way delay, as you can have very asymmetric scenarios specially when competing against tcp based flows.
For the server case, I could implement the mapping of the RTP timestamp to the original sender NTP timestamp on my server, but I don't feel anyone will compensate the timestamps to account for the sender rtt/2
, which will render this stat un-usable for relay server case.
How about removing the rtt of the stat and exposing the playoutTimeInReceiverNtp
and estimatedNtpDelta
?
playoutTimeInReceiverNtp = current time according to local NTP clock
estimatedNtpDelta = reportReceivedInReceiverNtp - reportSentInSenderNtp;
estimatedPlayoutDelta = playoutTimeInReceiverNtp - estimatedPlayoutTimestamp -
estimatedNtpDelta
If anyone wants to calculate the e2edelay by adding the smoothedRtt/2
, it can be done in js with the already known values.
I think JS can calculate the "estimatedNtpDelta" this using RTCRemoteInboundRtpStreamStats.roundTripTime (or totalRoundTripTime/rotalRoundTripTimeMeasurements), RTCRemoteOutboundRtpStreamStats.remoteTimestamp and RTCRemoteOutboundRtpStreamStats.timestamp. Chrome would need new metrics and some bug fixes for this to fly.
playoutTimeInReceiverNtp should be RTCInboundRtpStreamStats.timestamp if getStats uses the NTP clock. (Chrome may have wrong offset here in current implementation?)
estimatedPlayoutTimestamp is the timestamp in the sender NTP, though it assumes RTCP with RTP->NTP mapping, which wouldn't be reliable if the server uses contributing sources. But separating NTP delta and capture timestamp in separate metrics is good, because then we can use the same math if we get a better capture timestamp in the future through header extensions.
playoutTimeInReceiverNtp should be RTCInboundRtpStreamStats.timestamp
I thought about that too, but isn't it the timestamp of when the stat object is created, which may not be exactly the same as when the estimatedPlayoutTimestamp
is calculated?
Hmm yeah I think you're right, a separate metric makes more sense to make it explicit and not force implementations to have everything match to the millisecond.
Same would happen for RTCRemoteOutboundRtpStreamStats.remoteTimestamp
and RTCRemoteOutboundRtpStreamStats.timestamp
the timestamp would represent the time when the stat is created and not the time of when the time was received. So probably it is probably good to create a localTimestamp
to be able to calculate the delta.
The RTCStats.timestamp is, generally speaking, the "timestamp associated with [the] stats object". This is too vague. But it does have a more specific definition for the RTCRemote[...] stats:
For statistics that came from a remote source (e.g., from received RTCP packets), timestamp represents the time at which the information arrived at the local endpoint.
So we can use RTCRemoteOutboundRtpStreamStats.timestamp but not RTCInboundRtpStreamStats.timestamp without making some leaps of faith.
RTCInboundRtpStreamStats.timestamp without making some leaps of faith.
But we wouldn't need the inbound stats timestamp for anything right? at most the final end 2 end delay would be calculated with an smooth rtt (or min value over a short time window), so we don't need exact timestamp for that.
RTCInboundRtpStreamStats.timestamp would only useful under the assumption that it is the same time that estimatedPlayoutTimestamp was calculated - i.e. the the same as the relevant playout timestamp in receiver NTP - but it's a bit vague right now. If we add an explicit metric for this (playoutTimeInReceiverNtp) then no we don't need this timestamp for anything.
If we add an explicit metric for this (playoutTimeInReceiverNtp) then no we don't need this timestamp for anything.
Yes, I agree. I thought we were speaking about a different usage for that timestamp.
Note that NTP times are subject to clock slew and adjustments by NTP, and subject to error due to NTP mis configurations or peering issues, and subject to time drift or other issues when comparing values across systems
The spec for performance.now() (https://developer.mozilla.org/en-US/docs/Web/API/DOMHighResTimeStamp) says not to rely upon Date.now() or NTP for timing durations of things as it can report negative deltas. The same seems it would be true of anything in webRTC relying on NTP. Ideally this would use a single monotonically increasing clock and have the ability to report a round trip time that allows for placing an upper bounds on the delay.
The sender could number the frames and keep a mapping of frame number to timestamp, or use timestamps (from a monotonically increasing single system clock) to number the frames. The receiver could echo these back to the sender which could calculate delay by subtracting current monotonically time from value received, and then send RTT in a third hop. Values would lag behind by one half the RTT on a symmetric network which can be mitigated by checking for stalls on the receiver end and factoring the length of the stall into the metric that is ultimately surfaced.
Reviewing what we know presently-
- The sender and the receiver does not have synchronized clocks.
- Domhighres just makes sure that the local clock exposed to the application is monotonously increasing
- To calculate RTT we use RTCP RR, which has sender's sent TS and the The receiver's processing delay (DLSR).
If we had synchronized clocks, we could add a receiver sent TS -- which would give us Current RTT = rtcp rr reception time - DSLR - sender sent TS
A) OWD upstream = receiver sent TS - DLSR - sender Send TS. B) owd downstream = RTCP RR reception time - receiver sent TS.
Caveat This would be only available if the extension is implemented in rtcp and how often the rtcp extension is sent. Modulo the issue with clock sync.
o calculate RTT we use RTCP RR,
I would say the above is incorrect (if you want the actual delay of the media), because in creating local simulations & models I found the RTCP RTT corresponds to network packets, not frames of video. I found that it is possible for the video delay to be around 5,000% higher than the network packet delay, for example when network conditions change there is disproportionate amount of delay in the application layer vs the network layer
I first read this paper / research project, which made me realize the network RTT may not correlate with the video delay, specifically at times when the network conditions have changed recently:
- https://snr.stanford.edu/salsify/
- https://www.usenix.org/system/files/conference/nsdi18/nsdi18-fouladi.pdf
In order to asses the delay on the audio & video, it should be done separately, in my opinion. Each frame of video or discrete chunk of audio packets should be assigned a monotonically increasing timestamp on the sender, and the RTT should be calculated against the same clock (not only are the clocks not synchronized, NTP time on a single clock isn't even guaranteed to give accurate durations when sampled over time due to clock slew & "adjust time" system calls)
If we had synchronized clocks
In my opinion, this is not something that can be safe to assume or implemented easily. One option, yes, is to detect the clock drift using the 4 timestamps approach https://en.wikipedia.org/wiki/Network_Time_Protocol#Clock_synchronization_algorithm and try to detect when the clocks definitely are not in sync, but this is a slippery slope of re-inventing NTP which is a 20yr old system that doesn't need to be re-invented IMO.
IMO the easiest thing to do is to give a RTT, which gives an upper bound. Ideally it should be the delay of the media playout, not the delay of the network packets which would be rather useless IMO.
For example if you were to use an estimate that systemically understated the delay, there are between 800 & 2000 robotics companies operating with telepresence. Some for remote medical procedures, some for controlling driverless cars. Many people will understandably assume the delay surfaced is accurate, and if it is not accurate that can cause a safety issue. By accurate, I would say it never understates the delay of the actual media playback. If your use case is overlaying a sprite of a "silly hat" on a video conference call, I suppose this all may sound like yak shaving. I just worry that people may misunderstand the tradeoffs, and therefore would push for "safe" tradeoffs to be made.
FYI RTP header extensions were added to make this stuff work outside of simply polling getStats: https://w3c.github.io/webrtc-extensions/#rtcrtpcontributingsource-dictionary