matroska-specification
matroska-specification copied to clipboard
Consider providing a facility for integer-fraction timescales
It's pretty well-established that Matroska's poor timebase support is one of the format's worst properties. While it support very precise timestamps (down to the nanosecond), it's very inefficient to do so (and the resulting values still aren't exact for most input rates), so muxers tend to default to 1ms timestamps, which can lead to a variety of subtle issues, especially with high-packet-rate streams (e.g. audio) and VFR video content. Muxers can choose rates that are closer to the time base of their inputs (or the packet rate of the content), but exactly how best to do so has always been unclear, and some of the possible options would lead to either worse player behavior, or timestamp drift. I'm proposing a format addition to remedy this.
The only actual normative change I propose is this: in addition to the classic nanosecond-denominator time scale, muxers could provide 2 additional integers, serving as a numerator and denominator time base value, which is required to round to the existing nanosecond-scaled value.
This should be paired with some advice for muxer implementations on how to make use of this feature. This depends on the properties of the input. For reference, here are some examples of the error produced by rounding a variety of common time bases to the nearest nanosecond, scaled by 3 hours (a reasonable target for the duration of a film):
nearest_ns(x) = round(x * 1,000,000,000) / 1,000,000,000
ceil_ns(x) = ceil(x * 1,000,000,000) / 1,000,000,000
floor_ns(x) = floor(x * 1,000,000,000) / 1,000,000,000
nearest_error(x) = 1 - (x / nearest_ns(x))
ceil_error(x) = 1 - (x / ceil_ns(x))
floor_error(x) = 1 - (x / floor_ns(x))
nearest_error_3h(x) = nearest_error(x) * 60 * 60 * 3
ceil_error_3h(x) = ceil_error(x) * 60 * 60 * 3
floor_error_3h(x) = floor_error(x) * 60 * 60 * 3
e(x) = nearest_error_3h(1 / x)
ce(x) = ceil_error_3h(1 / x)
fe(x) = floor_error_3h(1 / x)
# Integer video frame rates
e(24) => 8.64e-5
e(25) => 0
e(30) => -0.0001
e(48) => -0.0002
e(50) => 0
e(60) => 0.0002
e(120) => -0.0004
# NTSC video frame rates
e(24/1.001) => -8.6314e-5
e(30/1.001) => 0.0001
e(48/1.001) => 0.0002
e(60/1.001) => -0.0002
e(120/1.001) => 0.0004
# TrueHD frame rates
e(44100/40) => -0.0057
e(48000/40) => -0.0043
e(88200/40) => 0.0062
e(96000/40) => 0.0086
# AAC frame rates
e(44100/960) => -0.0002
e(48000/960) => 0
e(88200/960) => 0.0003
e(96000/960) => 0
e(44100/1024) => 0.0002
e(48000/1024) => -0.0002
e(88200/1024) => -0.0003
e(96000/1024) => 0.0003
# MP3 frame rates
e(44100/1152) => 8.4375e-6
e(48000/1152) => 0
e(88200/1152) => -0.0004
e(96000/1152) => 0
# Other audio frame rates
e(44100/128) => -0.0012
e(48000/128) => 0.0013
e(88200/128) => -0.0012
e(96000/128) => -0.0027
e(44100/2880) => -7.425e-5
e(48000/2880) => 2.3981e-12
e(88200/2880) => -7.425e-5
e(96000/2880) => 2.3981e-12
# GCF of common short-first audio frame sizes
e(44100/64) => -0.0012
e(48000/64) => -0.0027
e(88200/64) => 0.0062
e(96000/64) => 0.0054
# Raw audio sample rates
e(44100) => 0.1253
e(48000) => -0.1728
e(88200) => 0.1253
e(96000) => 0.3456
fe(44100) => -0.351
ce(48000) => 0.3456
fe(88200) => -0.8273
fe(96000) => -0.6912
# MPEGTS time base
e(90000) => -0.108
ce(90000) => 0.8639
# Common multiples
e(30000) => -0.108
e(60000) => 0.216
e(120000) => -0.432
e(240000) => 0.8639
e(480000) => -1.7283
ce(30000) => 0.216
fe(60000) => -0.432
ce(120000) => 0.8639
fe(240000) => -1.7283
ce(480000) => 3.4549
As we can see, rounding common video and audio frame rates (including e.g. the least common multiple of 24 and 60 for that VFR case) produces a negligible amount of error over a reasonable duration. This means that for content where all timestamps can reasonably be expressed in integer values of those rates, there would be no significant error over common file durations, even if different streams were muxed with different time bases.
There are a few real-world time bases that would produce significant rounding error (upwards of 100ms) over the course of 3 hours when used in existing players: MPEGTS's 90000Hz, all common raw audio sample rates, and least-common-multiples between integer and NTSC video frame rates. This essentially means that mixing these rates with others would produce significant desync over a reasonable duration for static on-disk content; the same issue could occur when muxing very lengthy content (e.g. streaming).
All of these issues can be addressed in one of the following ways:
- Using a lower rate (e.g. 90,000Hz isn't usually the real content rate but instead an artifact of its previous container; expressing timestamps in samples rather than frames is usually unnecessary)
- Choosing the highest of the input rates for all streams (e.g. 48000 is a multiple of many common frame rates, including 24/1.001)
- Choosing a more precise common-multiple rate that may create a larger total drift, but does so equally for all streams (see the "Common multiples" section; 1/30000 is suitable for mixing 24fps and 30/1.001fps content alongside most common framed audio rates, while the later listed bases are suitable for increasingly large sets).
- Round some tracks' nanosecond timescales in the opposite direction, creating a larger drift, but potentially one with the same sign (and thus a closer value) as the drift in other tracks (this is probably too complex and niche to have substantial use)
- Fall back to classic rounded nanosecond-based timestamps (and don't write an integer-fraction time base at all)
- Use the extension, resulting in significant sync drift in older players that haven't implemented the change
This last option is usually unacceptable, but may be fine for files that use codecs that become available after the change is made (and thus are unavoidably non-backwards-compatible anyway).
If combined with clear advice in the spec on how muxers SHOULD (or MAY) decide on time bases for various possible input cases, I think this extension could get actual adoption in muxers and solve one of the format's longest-standing problems.
…I just realized I'd misremembered where timescales are specified when writing this (they're on the segment, not the track). Still, the same concept applies, just requiring common-multiple rates (though the TrackTimestampScale
element could be used to account for this to some extent; it's deprecated, but all existing players other than MPlayer-derived ones seem to support it).
Hi, I see this was brought up as an issue in the GitHub repository and am cross-posting to the cellar working group.
On Sep 7, 2020, at 12:02 AM, rcombs [email protected] wrote: It's pretty well-established that Matroska's poor timebase support is one of the format's worst properties. While it support very precise timestamps (down to the nanosecond), it's very inefficient to do so (and the resulting values still aren't exact for most input rates), so muxers tend to default to 1ms timestamps, which can lead to a variety of subtle issues, especially with high-packet-rate streams (e.g. audio) and VFR video content. Muxers can choose rates that are closer to the time base of their inputs (or the packet rate of the content), but exactly how best to do so has always been unclear, and some of the possible options would lead to either worse player behavior, or timestamp drift. I'm proposing a format addition to remedy this.
The only actual normative change I propose is this: in addition to the classic nanosecond-denominator time scale, muxers could provide 2 additional integers, serving as a numerator and denominator time base value, which is required to round to the existing nanosecond-scaled value.
This should be paired with some advice for muxer implementations on how to make use of this feature. This depends on the properties of the input. For reference, here are some examples of the error produced by rounding a variety of common time bases to the nearest nanosecond, scaled by 3 hours (a reasonable target for the duration of a film):
nearest_ns(x) = round(x * 1,000,000,000) / 1,000,000,000 ceil_ns(x) = ceil(x * 1,000,000,000) / 1,000,000,000 floor_ns(x) = floor(x * 1,000,000,000) / 1,000,000,000 nearest_error(x) = 1 - (x / nearest_ns(x)) ceil_error(x) = 1 - (x / ceil_ns(x)) floor_error(x) = 1 - (x / floor_ns(x)) nearest_error_3h(x) = nearest_error(x) * 60 * 60 * 3 ceil_error_3h(x) = ceil_error(x) * 60 * 60 * 3 floor_error_3h(x) = floor_error(x) * 60 * 60 * 3 e(x) = nearest_error_3h(1 / x) ce(x) = ceil_error_3h(1 / x) fe(x) = floor_error_3h(1 / x)
Integer video frame rates
e(24) => 8.64e-5 e(25) => 0 e(30) => -0.0001 e(48) => -0.0002 e(50) => 0 e(60) => 0.0002 e(120) => -0.0004
NTSC video frame rates
e(24/1.001) => -8.6314e-5 e(30/1.001) => 0.0001 e(48/1.001) => 0.0002 e(60/1.001) => -0.0002 e(120/1.001) => 0.0004
TrueHD frame rates
e(44100/40) => -0.0057 e(48000/40) => -0.0043 e(88200/40) => 0.0062 e(96000/40) => 0.0086
AAC frame rates
e(44100/960) => -0.0002 e(48000/960) => 0 e(88200/960) => 0.0003 e(96000/960) => 0 e(44100/1024) => 0.0002 e(48000/1024) => -0.0002 e(88200/1024) => -0.0003 e(96000/1024) => 0.0003
MP3 frame rates
e(44100/1152) => 8.4375e-6 e(48000/1152) => 0 e(88200/1152) => -0.0004 e(96000/1152) => 0
Other audio frame rates
e(44100/128) => -0.0012 e(48000/128) => 0.0013 e(88200/128) => -0.0012 e(96000/128) => -0.0027 e(44100/2880) => -7.425e-5 e(48000/2880) => 2.3981e-12 e(88200/2880) => -7.425e-5 e(96000/2880) => 2.3981e-12
GCF of common short-first audio frame sizes
e(44100/64) => -0.0012 e(48000/64) => -0.0027 e(88200/64) => 0.0062 e(96000/64) => 0.0054
Raw audio sample rates
e(44100) => 0.1253 e(48000) => -0.1728 e(88200) => 0.1253 e(96000) => 0.3456
fe(44100) => -0.351 ce(48000) => 0.3456 fe(88200) => -0.8273 fe(96000) => -0.6912
MPEGTS time base
e(90000) => -0.108
ce(90000) => 0.8639
Common multiples
e(30000) => -0.108 e(60000) => 0.216 e(120000) => -0.432 e(240000) => 0.8639 e(480000) => -1.7283
ce(30000) => 0.216 fe(60000) => -0.432 ce(120000) => 0.8639 fe(240000) => -1.7283 ce(480000) => 3.4549 As we can see, rounding common video and audio frame rates (including e.g. the least common multiple of 24 and 60 for that VFR case) produces a negligible amount of error over a reasonable duration. This means that for content where all timestamps can reasonably be expressed in integer values of those rates, there would be no significant error over common file durations, even if different streams were muxed with different time bases.
There are a few real-world time bases that would produce significant rounding error (upwards of 100ms) over the course of 3 hours when used in existing players: MPEGTS's 90000Hz, all common raw audio sample rates, and least-common-multiples between integer and NTSC video frame rates. This essentially means that mixing these rates with others would produce significant desync over a reasonable duration for static on-disk content; the same issue could occur when muxing very lengthy content (e.g. streaming).
All of these issues can be addressed in one of the following ways:
Using a lower rate (e.g. 90,000Hz isn't usually the real content rate but instead an artifact of its previous container; expressing timestamps in samples rather than frames is usually unnecessary) Choosing the highest of the input rates for all streams (e.g. 48000 is a multiple of many common frame rates, including 24/1.001) Choosing a more precise common-multiple rate that may create a larger total drift, but does so equally for all streams (see the "Common multiples" section; 1/30000 is suitable for mixing 24fps and 30/1.001fps content alongside most common framed audio rates, while the later listed bases are suitable for increasingly large sets). Round some tracks' nanosecond timescales in the opposite direction, creating a larger drift, but potentially one with the same sign (and thus a closer value) as the drift in other tracks (this is probably too complex and niche to have substantial use) Fall back to classic rounded nanosecond-based timestamps (and don't write an integer-fraction time base at all) Use the extension, resulting in significant sync drift in older players that haven't implemented the change This last option is usually unacceptable, but may be fine for files that use codecs that become available after the change is made (and thus are unavoidably non-backwards-compatible anyway).
If combined with clear advice in the spec on how muxers SHOULD (or MAY) decide on time bases for various possible input cases, I think this extension could get actual adoption in muxers and solve one of the format's longest-standing problems.
This has been discussed on the list before though I don’t remember clear consensus on how to address this. Steve even compiled a list of discussions on this at https://mailarchive.ietf.org/arch/msg/cellar/ZpZxhG1gML9xVx_ir1Jf6_gcI8U/ https://mailarchive.ietf.org/arch/msg/cellar/ZpZxhG1gML9xVx_ir1Jf6_gcI8U/.
I proposed an option in this https://mailarchive.ietf.org/arch/msg/cellar/mTprgjNqVbe20e6hyYxns8ZnVwY/ https://mailarchive.ietf.org/arch/msg/cellar/mTprgjNqVbe20e6hyYxns8ZnVwY/ where one of the existing reserved bits of the Block Header (in the byte that contains the keyframe, invisible, and lacing flags) be used as a flag for Timescale Alignment.
With this approach, new elements could be added to the track header with a numerator and denominator of a rationale time scale and if Timescale Alignment were set to true, then the nearest increment of the rationale time scale would be used. Example:
Thus if the frame rate of the track header is 120000/1001, then
If Matroska timecode is 4 and Enable TimeScale Alignment is 0, than it is at 4 / (1000000000 / TimecodeScale ). If Matroska timecode is 4 and Enable TimeScale Alignment is 1, than it is at 0 / 1200000 (nearest increment of the rationale frame rate).
If Matroska timecode is 17 and Enable TimeScale Alignment is 0, than it is at 17 / (1000000000 / TimecodeScale ). If Matroska timecode is 17 and Enable TimeScale Alignment is 1, than it is at 2002 / 1200000 (nearest increment of the rationale frame rate).
In a Matroska demuxer doesn’t understand the new nom/denom elements or the Alignment flag then it would simply use the existing nanosecond timestamp system.
In that thread there were other proposals, for example Steve discussed using a float to depict a point in time. Dave
Did anyone suggest storing the rounding error as a fraction? (With denominator stored in the track header, this is only 3 bytes per packet in the best case.)
Of course all of these ideas are terrible hacks compared to just storing it in the correct way.
On Sep 8, 2020, at 11:52 AM, wm4 [email protected] wrote:
Did anyone suggest storing the rounding error as a fraction? (With denominator stored in the track header, this is only 3 bytes per packet in the best case.)
That sounds interesting: to have the rounding error numerator in each block and the rounding error denominator in the track header. Perhaps a rounding error denominator could also be in the block but defaults to the one within the track header.
Of course all of these ideas are terrible hacks compared to just storing it in the correct way.
Yes, it is a challenge to fix this and maintain reverse compatibility. Dave
Did anyone suggest storing the rounding error as a fraction? (With denominator stored in the track header, this is only 3 bytes per packet in the best case.)
I don't like this as storing rounding errors ist imprecise as well (unless the global timestamp scaling factor is a multiple of the rounding error's denominator). I'm also quite unsure which denominator a multiplexer should chose. In order to express a rounding error precisely it must have a much higher resolution that the usual 1ms resolution of Matroska timestamps. For example, with 1001/30000 FPS content the rounding error will always be below one frame duration, therefore you'll have to make the denominator much larger.
Something else that came to mind when reading our previous discussion that Dave linked to: please keep in mind that any solution that sets values for the whole track in the track header will inevitably fail with mixed frame rate content or content with different interlacing, e.g. when multiplexing from an MPEG transport stream recorded from a DVB broadcast. Those bloody streams change frame rates all the time when the program changes, e.g. when transitioning to and from commercials (or just from an announcement to the movie). With our new and shiny precise timestamp calculation we'll either have to forbid such changes (unrealistic) or provide facilities to signal such changes in the form of some type of global index similar to cues. Unlike cues, though, such an index would have to be mandatory (a file without cues can be played just fine, even seeking works similar to seeking in Ogg files — meaning some kind of binary search).
File types whose timestamps are based solely on a stream's regular sampling frequency (MP4 usually is, but doesn't have to; Ogg does, too) all share those issues. MPEG TS on the other hand uses a 90 KHz-based clock which is fine for most video stuff but doesn't have enough resolution for sample-precision timing of audio tracks with high sampling frequency.
… in the correct way.
Due to what I've written above I'm pretty sure that there is no one correct way to store timestamps for a general purpose container that allows its content to change its time base in the middle.
In theory Matroska's timestamps can have sample-precision already (just make global timestamp scale small enough to match all of the tracks' time bases). The problem is with the waste of space that follows due to the bloody 16-bit integer offset in Block & SimpleBlock.
So if we're thinking about breaking compatibility anyway, why not think about a whole new SimpleBlock V2 that allows for much larger relative timestamps? Would make all existing players incompatible, though.
Another idea that only wastes space but doesn't destroy existing players' ability to play the file: adding a new child to Block
called PreciseRelativeTimestamp
or whatever that contains the difference between the timestamp-scaling-based timestamp & the actual, precise one, in nanoseconds. Cannot be used with SimpleBlocks, of course. Will take several bytes per BlockGroup.
I don't like this as storing rounding errors ist imprecise as well (unless the global timestamp scaling factor is a multiple of the rounding error's denominator).
It can be 100% exact. It's the rounding error after all - the number that needs to be added to the "classic" ms timestamp to get the fractional timestamp
I'm also quite unsure which denominator a multiplexer should chose. In order to express a rounding error precisely it must have a much higher resolution that the usual 1ms resolution of Matroska timestamps. For example, with 1001/30000 FPS content the rounding error will always be below one frame duration, therefore you'll have to make the denominator much larger.
It seems the denominator of the rounding error is simply the denominator of the original timestamp. E.g. in this case, the rounding error would have denominator 30000 and nominator (n*1001/30000 - int(n*1001/30000*1000)/1000) * 30000)
for frame n
or something like this. This is probably wrong, just typing this out casually. Actually probably also needs a constant nominator part (to be stored in the track header) of 1001.
Something else that came to mind when reading our previous discussion that Dave linked to: please keep in mind that any solution that sets values for the whole track in the track header will inevitably fail with mixed frame rate content or content with different interlacing, e.g. when multiplexing from an MPEG transport stream recorded from a DVB broadcast. Those bloody streams change frame rates all the time when the program changes, e.g. when transitioning to and from commercials (or just from an announcement to the movie). With our new and shiny precise timestamp calculation we'll either have to forbid such changes (unrealistic) or provide facilities to signal such changes in the form of some type of global index similar to cues. Unlike cues, though, such an index would have to be mandatory (a file without cues can be played just fine, even seeking works similar to seeking in Ogg files — meaning some kind of binary search).
What does Matroska do if the codec changes? Transport streams can do that, Matroska can't do that. I feel like bringing up such cases just complicates the whole discussion. You can't fix everything at the same time. But you can stall any progress by wanting to consider every possible future feature and requirement.
Besides, as was suggested in a previous post, the denominator part could be overridden per packet. This would cause some bytes of overhead in such obscure cases as mixing multiple framerates that are not known in advance.
In theory Matroska's timestamps can have sample-precision already (just make global timestamp scale small enough to match all of the tracks' time bases). The problem is with the waste of space that follows due to the bloody 16-bit integer offset in Block & SimpleBlock.
I guess you mean the fact that every packet will need its own cluster. But AFAIK that still doesn't give a way to get fractional timestamps? So, not an option.
So if we're thinking about breaking compatibility anyway, why not think about a whole new SimpleBlock V2 that allows for much larger relative timestamps? Would make all existing players incompatible, though.
Obviously not an option. If it were specified, it's likely everyone would disable this by default, except people who use Matroska in special setups where they control producer and consumer.
Another idea that only wastes space but doesn't destroy existing players' ability to play the file: adding a new child to Block called PreciseRelativeTimestamp or whatever that contains the difference between the timestamp-scaling-based timestamp & the actual, precise one, in nanoseconds. Cannot be used with SimpleBlocks, of course. Will take several bytes per BlockGroup.
I thought that was what I proposed here (except I wanted to use fractional numbers).
PS: I think obsessing about a few bytes per packet isn't useful. Having precise timestamps, even if it introduces overhead, is much more important. Nobody will discard Matroska as an option because it doesn't go to the edge of the theoretically possible for saving overhead.
What does Matroska do if the codec changes? Transport streams can do that, Matroska can't do that.
True. The difference is that having multiple time bases in the same track is something that exists & works today.
I'm really not trying not prevent progress here, and I'm not talking about each and every possible situation. I am talking about one specific situation that is in wide-spread use today.
What I am trying to prevent is implementing a scheme that's supposed to improve one aspect that simultaneously makes another aspect worse. Hence me talking about ways to signal a change in time base mid-stream. We'd also have to signal a precise timestamp at the point of change in time base so that the player can reconstruct the whole timeline properly without having to read blocks at each change in time base.
It seems like there are a few ways discussed to correct this:
- Express the time-base in the track so the demuxer can adjust the timestamps in the file to the closest increment of the time-base
- Express a fractional error value using a denominator in the track and numerator in the packet so the demuxer can give more precise timestamps
- Potentially allow overriding the denominator on a per-packet basis
- Express a second timestamp using a fractional time-base stored in the track
- Potentially allow just expressing the timestamp in a numerator/denominator so as to ignore/override the track's time-base
All of these would still require the current timestamp to still exist and thus would be compatible with current demuxers but newer demuxers would be able to read/derive more precise timestamps.
It seems the denominator of the rounding error is simply the denominator of the original timestamp.
Close. When I saw this first suggested, I did some math and figured out what it would be for the case of 44.1kHz AAC audio (this is what really sparked this conversation; see below). In this case the samples are 1024/44100 seconds long with the MKA using 1ms precision on the timestamps and so the error can be expressed as m/1000 - n*1024/44100
where m
is the timestamp in the MKA and n
is the packet number. To express the error exactly in integers, the denominator is lcm(1000, 44100)
(your basic fractions with common denominator) which in this case is 441000
. Using some quick examples:
Packet number | MKA timestamp | Error (using 441,000 as the denominator) |
---|---|---|
0 | 0 | 0 |
1 | 23 | 97 |
2 | 46 | 194 |
3 | 70 | -150 |
354 | 8220 | -60 |
355 | 8243 | 37 |
Also worth noting: the duration would likely need the same treatment.
Aside: I sparked this conversation in an internal discussion with @rcombs about AAC 44.1kHz audio in an MKA format. I was remuxing this to MPEG-TS and the MKA had only 1ms precision timestamps. Well, a simple remux would be multiplying these timestamps by 90 to match MPEG-TS. This simple remux resulted in packets at times 0, 2070, 4140, 6300, 8370 … which gave them effective durations of 0, 2070, 2070, 2160, 2070 and this inconsistency would cause stuttering audio in Apple's HLS demuxer. So this meant that remuxing MKA -> MPEG-TS required opening a codec in lavc to get more precise durations and thus derive timestamps without error.
P.S. These imprecise timestamps were one of the more annoying things we had to deal in Perian's MKV demuxer and that was over a decade ago.
The reason not to introduce a SimpleBlock v2 is that any hardware/software player that doesn't know it won't be able to play the files. It can be done in Matroska v5. Such files being unreabable by v5 parsers will also be marked as such. We might as well call it Matroska2 or something, just like WebM shares a lot with Matroska.
The practical question is whether there is a convenient way to have precise timestamps in v4 and make it work in existing players (and WebM, I know that's something they want as well).
The question about VFR (Variable Frame Rate) is not really an issue IMO. In the end you only have 1 or 2 frame rates mixed, and maybe with the same denominator. All you need is a fraction that handle both. Facebook even created a timebase that covers most common timebases for video. As long as you know the timebases you'll have to deal with before muxing you should be fine.
An important thing to note is that floating point should not be used at all (we want precision). All we have is the Matroska timebase x/1.000.000.000 s (x=TimestampScale) and the source material timebase(s) (a/44.100, b/48.000, c/24, d*10001/30000, etc). They are all fractions. So we should be able to find something that works with just fractions, using common denominators, fraction reduction, etc. It can get to large numbers very quickly as there are multiple tracks with different timebase (or odd fractions when a track uses VFR, see above).
What we have now is a timestamp for each Block as a fraction of TimestampScale/1.000.000.000.
What we want is a timestamp for each Block as a fraction of the source material. The difference between the two values is still a fraction. We can store this difference as a fraction. And we must also store the source material fraction.
Now we just have to do the math the find this "difference as a fraction". In particular to minimize the storage needed to do so if possible (if not, mandating BlockGroup for precise tracks is always an option). If we can fit in inside the (3) reserved bits of the SimpleBlock it would be perfect.
ISO/IEC 14496-12 "ISO base media file format" uses a "timescale" (counts per second) and "media sample durations". If timescale=30000 and media sample duration is 1001, you get NTSC fractional frame rate.
Similarly, ISO/IEC 14496-10 "Advanced Video Coding" has a clock tick defined as num_units_in_tick divided by the time_scale (see equation C-1). The presence of these in VUI is indicated by the timing_info_present_flag. For NTSC, time_scale may be 30000 and num_units_in_tick may be 1001.
Following my "pure rational numbers" approach we can say the following, for a Track sampled at the original frequency
, stored in a Matroska Segment with TimestampScale
:
The real timestamp for each sample S is: real(S) = S / frequency
The Matroska timestamp for the same sample is matroska(S) = S * TimestampScale / 1,000,000,000
The Cluster timestamp is just a value to add to S to get the proper value, so we can skip it for now. As we just check the rational values, the rounding introduced by divisions is not taken in account.
The difference between the real timestamp and the one we get from Matroska is:
real(S) - matroska(S)
= S / frequency - S * TimestampScale / 1,000,000,000
= (S * 1,000,000,000) / (frequency * 1,000,000,000) - (S * TimestampScale * frequency) / (frequency * 1,000,000,000)
= S * (1,000,000,000 - TimestampScale * frequency) / (frequency * 1,000,000,000)
We can already deduce a few things from this:
- The error grows linearly with the value of S.
- if TimestampScale is exactly
1,000,000,000 / frequency
, there is no error. - the bigger the rounding error of
1,000,000,000 / frequency
, the bigger the Matroska and real timestamps will diverge.
That gives some sampling frequencies where it's possible to achieve 0 error per sample:
- Audio: 8000 Hz, 16000 Hz, 32000 Hz, 50000 Hz, 64000 Hz,
- Video: 25 fps, 50 fps, 100 fps
That leaves out a lot of common ones:
- Audio: 11025, 22050, 37800, 44056, 44100, 48000, 50400, 88200, 96000, 176400, 192000, 352800
- Video: 16, 24000/1001, 24, 30000/1001, 30, 60000/1001, 48, 60, 90, 100, 120000/1001, 120
The other way to reduce the error, is to reduce the value of S. We already effectively reduce the value we store to a 16 bits integer, so the value is always between -32,768 and 32,767. If we were to store the error in the remaining 3 bits of a SimpleBlock
that's still 13 bits too many.
By limiting the possible values of S in a Cluster to [-4,3]
(3 bits), in other words 8 frames, it is possible to store each frame with the Matroska timestamp and the error based on TimestampScale * frequency
. This is also feasible because audio is usually not found in 1 samples, but by chunks of samples in one Frame. Sometimes all chunks have the same amount of samples, sometimes not, but each amount of samples is based on the same multiple (worst case scenario is many chunk sizes unrelated).
For video that means at most 8 frames per Cluster, for a 29,97 fps file that's 267ms. This is very small.
A Block
has one extra free bit, so we could double these values. That's still very small IMO. And that's the case where the TimestampScale
is precisely adjusted for one track. When you have 2 or more, finding a value of TimestampScale
that works well with all frequency
becomes even harder.
I think the scope where it works, even with the proper muxing guidelines, is too narrow to be worth using all the reserved bits. In particular because common frequencies like 44100 Hz or 30000/1001 fps will introduce errors no matter what and will need to use this system.
There could be other clever ways to do this. We could use a bit in the Block that says the timestamp "shift" is stored after/before the Block data, but that would be incompatible with all existing readers. That would be equivalent to using a new lacing format.
Another way would be to force using a BlockGroup
to have precise timing and store the "shift" in a new element. It might only need 16 bits of storage, so that would translate in 3 extra octets per BlockGroup
It seems one of the aspect of this not discussed is how the rounding of the current system works and how it could be adapted. We assume that we start with the current system and try to fit the correct fraction in there. We may do the other way around, ie have the fraction and use that to set the Block/Cluster timestamp value. The rounding error is then on older parsers assuming a timestamp value when in fact it's another value. But the old system is already known to be imprecise/inaccurate. It's not assumed to be sample precise. So a little more, a little less rounding error should not be a big deal.
What we cannot really do is add some information per-track to modify how the Block/SimpleBlock values are interpreted. That would break backward compatibily. For that we would need BlockV2 and SimpleBlockV2.
So we could store the TimestampScale and a fraction that is the actual fraction it's based one.
Let's see what happens for 29.97fps video, or 30000/1001 Hz. The most accurate TimestampScale is 33,366,667 (nanosecond per frame/lace, rounded). We also store the Segment timestamp fraction as {30000, 1001}:
Frame Number | New Block Value | Old Parser timestamp | Real timestamp | Difference |
---|---|---|---|---|
0 | 0 | 0 ns | 0 ns | 0 ns |
1 | 1 | 33366667 ns | 33366666 ns | 1 ns |
2 | 2 | 66733334 ns | 66733333 ns | 1 ns |
3 | 3 | 100100001 ns | 100100000 ns | 1 ns |
4 | 4 | 133466668 ns | 133466666 ns | 2 ns |
5 | 5 | 166833335 ns | 166833333 ns | 2 ns |
6 | 6 | 200200002 ns | 200200000 ns | 2 ns |
7 | 7 | 233566669 ns | 233566666 ns | 3 ns |
8 | 8 | 266933336 ns | 266933333 ns | 3 ns |
9 | 9 | 300300003 ns | 300300000 ns | 3 ns |
10 | 10 | 333666670 ns | 333666666 ns | 4 ns |
.. | .. | .. | .. | .. |
65532 | 65532 | 2186584421844 ns | 2186584400000 ns | 21844 ns |
65533 | 65533 | 2186617788511 ns | 2186617766666 ns | 21845 ns |
65534 | 65534 | 2186651155178 ns | 2186651133333 ns | 21845 ns |
65535 | 65535 | 2186684521845 ns | 2186684500000 ns | 21845 ns |
The Old Parser timestamp is the timestamp older parsers would see: Block Value * TimestampScale. The Real timestamp is the one using the fraction: Block Value * 1001 / 30000.
For 44100 Hz audio we get the following, with a TimestampScale of 22,676 (nanosecond per frame/lace, rounded).
Frame Number | New Block Value | Old Parser timestamp | Real timestamp | Difference |
---|---|---|---|---|
0 | 0 | 0 ns | 0 ns | 0 ns |
1 | 1 | 22676 ns | 22675 ns | 1 ns |
2 | 2 | 45352 ns | 45351 ns | 1 ns |
3 | 3 | 68028 ns | 68027 ns | 1 ns |
4 | 4 | 90704 ns | 90702 ns | 2 ns |
5 | 5 | 113380 ns | 113378 ns | 2 ns |
6 | 6 | 136056 ns | 136054 ns | 2 ns |
7 | 7 | 158732 ns | 158730 ns | 2 ns |
8 | 8 | 181408 ns | 181405 ns | 3 ns |
9 | 9 | 204084 ns | 204081 ns | 3 ns |
10 | 10 | 226760 ns | 226757 ns | 3 ns |
.. | .. | .. | .. | .. |
1636 | 1636 | 37097936 ns | 37097505 ns | 431 ns |
1637 | 1637 | 37120612 ns | 37120181 ns | 431 ns |
1638 | 1638 | 37143288 ns | 37142857 ns | 431 ns |
1639 | 1639 | 37165964 ns | 37165532 ns | 432 ns |
1640 | 1640 | 37188640 ns | 37188208 ns | 432 ns |
.. | .. | .. | .. | .. |
65532 | 65532 | 1486003632 ns | 1485986394 ns | 17238 ns |
65533 | 65533 | 1486026308 ns | 1486009070 ns | 17238 ns |
65534 | 65534 | 1486048984 ns | 1486031746 ns | 17238 ns |
65535 | 65535 | 1486071660 ns | 1486054421 ns | 17239 ns |
The difference is less than one sample. When packed at 40 samples per frames (shortest packing in @rcombs example). We would then use a fraction of {40, 44100} and a TimestampScale of 907029 :
Frame Number | New Block Value | Old Parser timestamp | Real timestamp | Difference |
---|---|---|---|---|
0 | 0 | 0 ns | 0 ns | 0 ns |
1 | 1 | 907029 ns | 907029 ns | 0 ns |
2 | 2 | 1814058 ns | 1814058 ns | 0 ns |
3 | 3 | 2721087 ns | 2721088 ns | -1 ns |
4 | 4 | 3628116 ns | 3628117 ns | -1 ns |
5 | 5 | 4535145 ns | 4535147 ns | -2 ns |
.. | .. | .. | .. | .. |
47392 | 47392 | 42985918368 ns | 42985941043 ns | -22675 ns |
47393 | 47393 | 42986825397 ns | 42986848072 ns | -22675 ns |
47394 | 47394 | 42987732426 ns | 42987755102 ns | -22676 ns |
47395 | 47395 | 42988639455 ns | 42988662131 ns | -22676 ns |
47396 | 47396 | 42989546484 ns | 42989569160 ns | -22676 ns |
47397 | 47397 | 42990453513 ns | 42990476190 ns | -22677 ns |
.. | .. | .. | .. | .. |
65533 | 65533 | 59440331457 ns | 59440362811 ns | -31354 ns |
65534 | 65534 | 59441238486 ns | 59441269841 ns | -31355 ns |
65535 | 65535 | 59442145515 ns | 59442176870 ns | -31355 ns |
We get less than 1 sample error with 47393 frames stored, or 42s worth of samples in a Cluster.
The worst case scenario is the highest, not easily divisible, frequency 352800. It gives:
Frame Number | New Block Value | Old Parser timestamp | Real timestamp | Difference |
---|---|---|---|---|
0 | 0 | 0 ns | 0 ns | 0 ns |
1 | 1 | 2834 ns | 2834 ns | 0 ns |
2 | 2 | 5668 ns | 5668 ns | 0 ns |
3 | 3 | 8502 ns | 8503 ns | -1 ns |
4 | 4 | 11336 ns | 11337 ns | -1 ns |
5 | 5 | 14170 ns | 14172 ns | -2 ns |
.. | .. | .. | .. | .. |
6066 | 6066 | 17191044 ns | 17193877 ns | -2833 ns |
6067 | 6067 | 17193878 ns | 17196712 ns | -2834 ns |
6068 | 6068 | 17196712 ns | 17199546 ns | -2834 ns |
6069 | 6069 | 17199546 ns | 17202380 ns | -2834 ns |
.. | .. | .. | .. | .. |
65533 | 65533 | 185720522 ns | 185751133 ns | -30611 ns |
65534 | 65534 | 185723356 ns | 185753968 ns | -30612 ns |
65535 | 65535 | 185726190 ns | 185756802 ns | -30612 ns |
Here we achieve less than on sample error when there is less than 6067 samples in a Cluster. This can be doubled by using signed values for the Block timestamp value. The range to get less than one sample error becomes [-6067,6066]. And by packing samples by at least 11 samples, we always get less than 1 sample error. With 22 samples we get less than half a sample duration error, which should be enough with rounding.
So with single track files we can probably achieve sample precision easily.
With mixed frequencies it becomes more complicated. For example the 29.97 fps video with the 44100 Hz audio. We have 1001/30000 and 1/44100 so the fraction to use would be 1001/reduced(30000, 44100) where reduced(A, B) is each number multiplied and divided by their Greatest Common Denominator. In this case (30000*44100)/100 = 13230000. That gives a round TimestampScale of 75661 ns/tick.
That gives these Blocks:
Block Value | Old Parser timestamp | Real timestamp | Difference |
---|---|---|---|
0 | 0 ns | 0 ns | 0 ns |
1 | 75661 ns | 75661 ns | 0 ns |
2 | 151322 ns | 151322 ns | 0 ns |
3 | 226983 ns | 226984 ns | -1 ns |
4 | 302644 ns | 302645 ns | -1 ns |
5 | 378305 ns | 378306 ns | -1 ns |
6 | 453966 ns | 453968 ns | -2 ns |
7 | 529627 ns | 529629 ns | -2 ns |
8 | 605288 ns | 605291 ns | -3 ns |
9 | 680949 ns | 680952 ns | -3 ns |
10 | 756610 ns | 756613 ns | -3 ns |
11 | 832271 ns | 832275 ns | -4 ns |
12 | 907932 ns | 907936 ns | -4 ns |
.. | .. | .. | .. |
441 | 33366501 ns | 33366666 ns | -165 ns |
.. | .. | .. | .. |
882 | 66733002 ns | 66733333 ns | -331 ns |
.. | .. | .. | .. |
1323 | 100099503 ns | 100100000 ns | -497 ns |
.. | .. | .. | .. |
32634 | 2469121074 ns | 2469133333 ns | -12259 ns |
For the video track we would get something like this:
Frame Number | New Block Value | Old Parser timestamp | Real timestamp | Difference |
---|---|---|---|---|
0 | 0 | 0 ns | 0 ns | 0 ns |
1 | 441 | 33366501 ns | 33366666 ns | -165 ns |
2 | 882 | 66733002 ns | 66733333 ns | -331 ns |
3 | 1323 | 100099503 ns | 100100000 ns | -497 ns |
.. | .. | .. | .. | .. |
74 | 32634 | 2469121074 ns | 2469133333 ns | -12259 ns |
.. | .. | .. | .. | .. |
148 | 65268 | 4938242148 ns | 4938266666 ns | -24518 ns |
We can store almost 5s In a Cluster.
For the audio track, on the other hand, we cannot recover each sample easily
Sample Number | Real timestamp | Block Value |
---|---|---|
0 | 0 ns | 0 |
1 | 22675 ns | ~0 |
2 | 45351 ns | ~1 |
3 | 68027 ns | ~1 |
4 | 90702 ns | ~1 |
5 | 113378 ns | ~1 |
6 | 136054 ns | ~2 |
7 | 158730 ns | ~2 |
.. | .. | .. |
300 | 6802721 ns | ~90 |
The Block Value doesn't map to an exact sample timestamp (and vice versa).
It seems that if we apply a factor of 3 we may get better results. So we could have a Segment fraction of 1001/3*13230000, with a rounded TimestampScale of 25220 ns/tick.
Sample Number | Block Value | Old Parser timestamp | Real timestamp | Difference |
---|---|---|---|---|
0 | 0 | 0 ns | 0 ns | 0 ns |
1 | 1 | 25220 ns | 22675 ns | 2545 ns |
2 | 2 | 50440 ns | 45351 ns | 5089 ns |
3 | 3 | 75660 ns | 68027 ns | 7633 ns |
4 | 4 | 100880 ns | 90702 ns | 10178 ns |
5 | 5 | 126100 ns | 113378 ns | 12722 ns |
6 | 6 | 151320 ns | 136054 ns | 15266 ns |
7 | 7 | 176540 ns | 158730 ns | 17810 ns |
8 | 8 | 201760 ns | 181405 ns | 20355 ns |
9 | 9 | 226980 ns | 204081 ns | 22899 ns |
10 | 10 | 252200 ns | 226757 ns | 25443 ns |
.. | .. | .. | .. | .. |
65534 | 65534 | 1652767480 ns | 1486031746 ns | 166735734 ns |
65535 | 65535 | 1652792700 ns | 1486054421 ns | 166738279 ns |
We lose about 1 sample precision every 10 samples, or 10%. For a full Block that's about 166ms shift (or rather half when using signed 16 bits). That's a lot. Even packed at 40 samples per frame that's still about 20ms, when such a frame is 1 ms.
If we use the full fraction {1001, 30000*44100} we cannot store more than one video frame per Cluster.
There doesn't seem to be a system where it works by storing the Block value as a real fraction value. At least when mixing "heterogeneous" frequencies. It works with single tracks or frequency that are easily divisible. And not if we want to keep backward compatibility (Block/SimpleBlock).
A little background on this, for adaptive streaming it's important when you switch to one "quality" (representation) to another one to switch exactly the frame and audio you want. I don't know if they are sample exact for audio, especially as each codec (or different encoding parameters) may pack a different amount of samples per frame. So the boundaries don't totally overlap. Maybe there's an offset that tells on which sample to start. Or an exact clock gives the exact timestamp for each sample in each representation anyway.
Give that, the important phrase here is
So with single track files we can probably achieve sample precision easily.
In adaptive streaming you don't (usually) use muxed tracks. So you can pick each channel independently with the best possible choice at any given time. So in these conditions we can be sample precise. All we need is to tell the original clock (numerator/denominator) of the Track. A new parser would use that value with the Block
timestamp value. Older parsers would not see it and would use the Block
timestamp value with the global TimestampScale
. As described above, the difference is minimal, as long as the TimestampScale
is matching the fraction.
I'll send a proposal for new elements to store this fraction and the necessary changes on how to interpret the timestamps.
The larger problem is because we want a rationale number that works for all the tracks (theoretically possible) and at the same time have a sensible value that will not require huge values of the numerator for each timestamp in a Block. We only have 16 bits there. As seen above in most cases it doesn't work. And that's because we have one global "clock", defining all Block (and Cluster and more) "ticks".
We could however alter the interpretation of each Block value to adjust to a better "clock" that works for that track. So that we end up with a better range of values for the numerator. And luckily we already have TrackTimestampScale
! It's a float number to apply to each Block tick value to get the proper timestamp for that Block (or Track in general). It is currently marked as deprecated because it's usage was limited, as it's a float, and it was supposed to allow changing timestamps without remuxing a track. But that's not convenient at all.
But just like we introduce a rational number to use instead of the TimestampScale
, we can use a rational number to use instead of TrackTimestampScale
. And with TrackTimestampScale
the rounded value of this rational number. Despite being deprecated, TrackTimestampScale
is supported in at least libavformat (ffmpeg) and VLC demuxers. It's possible it's not supported in a lot of demuxers, especially since it was marked as deprecated anyway. For example it's not supported in WebM. But that's less of a problem as they tend to add new elements when they need it.
So a Block timestamp would be
( ( Block tick * TrackTimestampScale ) + Cluster tick ) * TimestampScale
The formula in the old website (and current RFC draft) is incorrect as it applied the TrackTimestampScale
on the Cluster tick as well. The vlc code seems to use it incorrectly (I can fix) but the libavformat seems to be correct. In both case adding support for sample accurate timestamps would mean fixing those as well.
In a new parser TimestampScale
and TrackTimestampScale
would both be rational numbers.
In old parser the TimestampScale
would be the rounded nanosecond based value and TrackTimestampScale
the floating point value value of TrackTimestampScale
. They would be less precise but they were never meant to be anyway.
So let's take the previous example that didn't work: 29.97 fps video with the 44100 Hz audio. Now we can have TimestampScale * TrackTimestampScale = 1001/30000 for video and TimestampScale * TrackTimestampScale = 1/44100 for audio (or 40/44100 if samples are always packed by 40 but we don't even need that). We can represent 65536 ticks for each Track in a Cluster.
Now the critical part is the Cluster tick value. To have sample accurate values on each Block it also has to provide ticks that are sample accurate for both tracks. In this case a (rational) TimestampScale of 1/(30000 * 441) should do it. All ticks on the 1/44100 clock are represented (0, 300, 600, 900, 1200, etc) on this clock. All ticks on the 1001/30000 clock are also represented (0, 1001 * 441, 2 * 1001 * 441, 3 * 1001 * 441, etc) on this clock. In a 24h movie that's 24 * 60 * 60 * 30000 * 441 ticks which is still a small value (0x10A 2466 8800 in hexadecimal) compared to the 64 bits room we have for each Cluster Timestamp.
There is a slight problem though. The rounded TimestampScale
would be 76 ns. Over 24h the "old clock" tick would be 24 * 60 * 60 * 30000 * 441 * 76 ns or 86,873.472 s or 24,13 h. That's a 0.548% error.
In general that system is used with a 1ms precision resulting in even more innacurate values for the 33,366 ms video durations. So it shouldn't have any impact.
Now what is the magic formula to get the proper rational TimestampScale (TimestampNumerator
and TimestampDenominator
) ? It looks like
TimestampDenominator = SamplingFreq A Denominator * SamplingFreq V Denominator / GCD( SamplingFreq A Denominator, SamplingFreq V Denominator )
Where the GCD() function gives the Greatest Common Denominator
for A and V.
But that's not the value we used. Both 30000 and 441 are divisible by 3. So it should be 10000 * 441. That gives a legacy TimestampScale
of 227 ns, which should give a smaller difference between the 2 systems.
with this value the rational TrackTimestampScale
values would be 100/1 for audio and (1001 * 147)/1 for video, also stored as floating values in the legacy field.
For audio the Block ticks would result in:
Block timestamp = ( ( Block tick * TrackTimestampScale ) + Cluster tick ) * TimestampScale
Block timestamp = ( ( Block tick * 100/1 ) + Cluster tick ) * 1/(10000 * 441)
Block timestamp = ( Block tick * 100/1 ) * 1/(10000 * 441) + Cluster tick * 1/(10000 * 441)
Block timestamp = Block tick * 100 * / (10000 * 441) + Cluster tick * / (10000 * 441)
Block timestamp = Block tick / (100 * 441) + Cluster tick / (10000 * 441)
Block timestamp = Block tick / 44100 + Cluster tick / (10000 * 441)
For video the Block ticks would result in:
Block timestamp = ( ( Block tick * TrackTimestampScale ) + Cluster tick ) * TimestampScale
Block timestamp = ( ( Block tick * 1001 * 147 ) + Cluster tick ) * 1/(10000 * 441)
Block timestamp = ( Block tick * 1001 * 147 ) * 1/(10000 * 441) + Cluster tick * 1/(10000 * 441)
Block timestamp = ( Block tick * 1001 * 147 ) / (10000 * 441) + Cluster tick / (10000 * 441)
Block timestamp = Block tick * 1001 / 30000 + Cluster tick / (10000 * 441)
It seems we have a system that works well for two tracks. It work just as much with more tracks as long as the GCD of all SamplingFreq Denominator is big enough, resulting in a rounded legacy TimestampScale
that should be above 50 and MUST NOT be 1 or even less 0 anyway.
There is a small problem with the audio in the example above, we only get 65536/44100 second possible per Cluster. But given audio samples are usually packed by a fixed number of samples or a variable number of samples with a base common number, or even a multiple of 4. That packing unit number can be set as the numerator of the audio TrackTimestampScale
which would then me Packing Unit * 100 / 1
. That multiplies the possible amount of audio per Cluster. Even a Packing Unit of 4 would give 5.9s audio samples per Cluster which is good enough.
So what happens when using only the legacy values to compute the timestamps. In the example above, the TimestampScale
is 227 ns. The audio TrackTimestampScale
is 100.0f and the video TrackTimestampScale
is 147147.0f.
The first audio ticks are represented like this
Audio Tick | Real timestamp (ns) | Block Value | Timestamp (ns) | Difference |
---|---|---|---|---|
0 | 0.0 | 0 | 0 | 0.0 |
1 | 22675.7 | 1 | 22700 | -24.3 |
2 | 45351.5 | 2 | 45400 | -48.5 |
3 | 68027.2 | 3 | 68100 | -72.8 |
4 | 90702.9 | 4 | 90800 | -97.1 |
5 | 113378.7 | 5 | 113500 | -121.3 |
.. | .. | .. | .. | .. |
65533 | 1486009088.0 | 65463 | 1486010112 | -1024.0 |
65534 | 1486031744.0 | 65464 | 1486032768 | -1024.0 |
65535 | 1486054400.0 | 65465 | 1486055552 | -1152.0 |
The Block value being the integer stored in the Block based on the real timestamp, the TimestampScale
and the TrackTimestampScale
.
The second timestamp being the timestamp a parser would deduce from the Block Value, , the TimestampScale
and the TrackTimestampScale
.
The difference between the deduced and and real timestamp happens because the 227 ns TimestampScale is not an exact value.
In the end, in the whole Cluster, the difference is always less than 11392 ns. That's less than one audio tick.
That means even without adding any element, just reviving the floating point TrackTimestampScale
, we could store sample accurate timestamps.
We just need to apply the rules above to find the proper TimestampScale
and the TrackTimestampScale
, as if they were handled as rational numbers.
We could however add the original clock in each track to give an accurate way for the reader to round the values (ie, get the values from the second column, when the values of the fourth column are computed). We have the SamplingFrequency
but it's in floating point. We should also do the same for video tracks.
So probably a rationale value stored with the generic TrackEntry
fields.
I did a test program to run different scenarii: all the audio/video sampling frequencies listed above mixed (1 audio/1 video). The program can be found here.
The result of the run of this program is found in this dirty Markdown file.
In some cases there are some rounding errors that can't be recovered. There are also many cases where the possible duration of audio in a Cluster
is way too small. So I added some examples with common packing and then the duration is much more usable. And then it also avoids the rounding errors (see "Audio 11025 Hz (128 packed)" for example).
The video errors are always negligible as their occur after very long durations. Durations that are impossible to reach given the duration constraints on the audio.
Maybe I can add an extra layer and try the common packing sizes mentioned by @rcombs. But from a first look it seems to solve both the limited duration of audio in a Cluster and the possible rounding errors.
After computing the TrackTimestampScale
of each track as a floating value, rather than an integer (to match what a rationale value would be), we can counter the rounding error introduced by the small TimestampScale
in the most tricky cases.
In the end the only errors (half a tick, so the wrong sample/tick would be assumed on the output of the demuxer) only occurs on video tracks, in rare cases and after long duration in a Cluster (145s minimum which is a lot).
The only problem remaining is that the amount of audio samples possible in a Cluster with such small TimestampScale
values is small. Sometimes only 0.19 second is possible (16 bits ticks). That can be solved by packing samples to achieve a possible duration in a Cluster over 5s (the common amount acceptable). For the cases where only 0.19s is possible (352800 Hz) packing at least by 263 samples should be sufficient. In most case even packing 10 samples is sufficient.
In the end the packing problem is directly related to the sampling frequency of the audio. This problem exists regardless of the sample accuracy of timestamps. High sampling frequency requires enough packing of samples to fit a useful duration in a Cluster.
This problem aside, we can always use TrackTimestampScale
with a rounded TimestampScale
(based on audio denominator * video denominator / GCD
to achieve sample accuracy.
Mixing more than one audio track might cause some problems if the sampling frequency differ too much (doesn't fit the GCD). But for 2 tracks it's achievable all the time.
I made a small calculation error in my tests as the original sample frequency numerator was not used to compute the real timestamp. With examples where the numerator was artificially inflated (1/24 = 1000/24000 = 2000/48000) to try to match audio ranges it gave an incorrect error. In fact in all cases, there is no error on the audio or video tracks.
The TrackTimestampScale
countering the rounding of the TimestampScale
is so efficient, it works even with a TimestampScale
of 1 ns in all tests. Which means it also works regardless of the number of tracks and their sampling frequencies. It could even be used for frequencies higher than a GHz (< 1 ns period).
So the real problem left if the amount of audio possible per Cluster. As said before, High sampling frequency requires enough packing of samples to fit a useful duration in a Cluster. This is not a new problem. And the TrackTimestampScale
has no effect on this (I double checked). There are 65536 ticks per Cluster per Track possible (always with the proper TrackTimestampScale
, now it always has an optimum range).
Audio codec usually pack samples with a fixed amount of samples (or a few possible fixed values that may change in the same stream). Raw audio can do the same. In this case we can compute the TrackTimestampScale
based on the "packing frequency" rather than the sampling frequency. For example the "packing frequency" of 10 samples packed at 44100 Hz would be 4410 Hz. Allowing 10x more duration per Cluster. In other words, each ticks is worth 10x more duration than without packing.
If we consider it should always be possible to store at least 5s of audio per Cluster, then the problem starts at frequencies higher than 13107 Hz (65536 ticks / 5 s). That's pretty much all the time.
With packing we don't get the timestamp of each individual sample. We only get the timestamp of the first sample of each pack of sample. But since we know the sampling frequency of the audio (SamplingFrequency
element) we can tell the exact timestamp of all the other samples.
For Variable Framerate of video (I suppose it's rare for audio) there is another problem. There isn't one frequency. For example there might be film source (24 fps) and NTSC video source (29.97 fps) mixed in the same Segment. Or video captures that are sometimes at 60 fps, sometimes 144 fps and sometimes values that just occur when they can (if the game is able to control the V-Sync directly).
I suppose other containers will also have a hard time giving an exact timestamp value for each frame.
In the case of 2 fixed sources mixed together, it should be possible to accommodate the TrackTimestampScale
to both using the rationale fractions. It will reduce the duration possible for that track. But even for 121 and 123 fps that gives a mixed frequency of 14 883 Hz (with ticks from one or the other falling on exact ticks of this clock). That's more than 13107 Hz which means we can't store 5s in a Cluster, but it's pretty close (4.4 s).
For too many or too heterogeneous sources there's not really a good solution. But these sources are doomed to never have accurate timestamps anyway. In that case a resolution of 0.1 ms (10000 Hz) should give a good estimate and enough duration (6.5 s) per Cluster.
Given all this I think #437 is a good all around solution. It may not even require to store the exact fraction of the original (although it's probably needed to remux into other containers).
TrackTimestampScale
has been in the Matroska specs forever and is supposed to be used by demuxers. So extending it to newer versions of Matroska should be a no brainer. Unfortunately there are high chances that's it's not used properly. Since noone really used it so far (AFAIK) it's usually assumed to be 1.0, and it's discarded, all the math on timestamps being done with integers.
The proposed solution radically changes that. Almost all the time the TrackTimestampScale
has a value very far from 1.0 (up to 10416667 in the frequencies I tested). For all parsers not using the TrackTimestampScale
, only the first timestamp of a Cluster
will be usable (tick 0) the rest will look very odd (usually way too small values). It should always be possible to adjust the TimestampScale
so that one track has a TrackTimestampScale
of 1.0. It should be the track with the highest "packing frequency". All other tracks will be almost unusable to a non-conformant parser, but at least one track will be (most likely the best audio track).
I think libavformat and the libmatroska based demuxers (including VLC) should handle this properly. That already covers a lot of players, demuxers, muxers.
TrackTimestampScale
(formerly known as TrackTimecodeScale) is not part of WebM. So parsers exclusively dealing with WebM (the Firefox one, dunno about Chromium) may have issues with this.
Most TV/streaming boxes are probably not using libavformat or libmatroska so I'm not sure they handle this properly either.
The fact that each TrackTimestampScale
should be computed using the "packing frequency" of the track will also add some friction. Matroska has always be codec agnostic, ie it doesn't need to know anything about a codec to be muxed (although it does store information about the codec). Now to mux "accurately" we need to know, before writing/having any frame, how many samples will be found in each codec frame. Most codec have a fixed value so it won't be too hard. But modern codecs have many window sizes. That can become tricky to know exactly what to use, in some cases there might not even be a common factor. In that case the "packing frequency" = "sampling frequency" and we can't store a lot of samples per Cluster. or we just give up and sample accuracy.
So I think with all (audio) codec we should mention the number of samples per frame that can be used safely. (sampling frequency / that number = packing frequency). That should be done in the codec specs.
There is also the question of Cues and Chapters. The timestamps are stored in "absolute" values (which means in nanoseconds). Introducing the TrackTimestampScale
as "unrounded" floating point means the actual timestamps of a frame is not an exact multiple of the global TimestampScale
anymore.
So Cues/Chapters referencing a particular frame (or audio block) should use the same value that will come out of the demuxer. The value is not always exactly the same value that was written but the error is small enough not to mistake it for another sample. But since demuxers/players will compare values when seeking it's better to match exactly the value that will be read in the file.
Using packed audio with a factor on the TrackTimestampScale
also means it won't be possible to reference (audio) samples individually. The granularity will depend on the factor applied to the TrackTimestampScale
. For example for samples packed by 40 the factor applied may be 2, 4, 5, 8, 10, 20, or 40. That allows more or less duration per Cluster / more or less Cue precision inside a frame of a Block.
Actually once you have the timestamp of the first sample, you don't need to know the rest of the timing in the packed audio. It's outside of the container level that the right sample will be picked.
The timestamp in Cues/Chapters can either use the real timestamp (in nanosecond) of the sample. Or they could shift it the same way the timestamp of the first sample in the Block is shifted. The difference is a rounding error that in the end will result in the sample referenced. Except for this the container doesn't use the values for exact comparison so it will have no impact there. I think it's better to use the real sample timestamp in that case.