matroska-specification Consider providing a facility for integer-fraction timescales

It's pretty well-established that Matroska's poor timebase support is one of the format's worst properties. While it support very precise timestamps (down to the nanosecond), it's very inefficient to do so (and the resulting values still aren't exact for most input rates), so muxers tend to default to 1ms timestamps, which can lead to a variety of subtle issues, especially with high-packet-rate streams (e.g. audio) and VFR video content. Muxers can choose rates that are closer to the time base of their inputs (or the packet rate of the content), but exactly how best to do so has always been unclear, and some of the possible options would lead to either worse player behavior, or timestamp drift. I'm proposing a format addition to remedy this.

The only actual normative change I propose is this: in addition to the classic nanosecond-denominator time scale, muxers could provide 2 additional integers, serving as a numerator and denominator time base value, which is required to round to the existing nanosecond-scaled value.

This should be paired with some advice for muxer implementations on how to make use of this feature. This depends on the properties of the input. For reference, here are some examples of the error produced by rounding a variety of common time bases to the nearest nanosecond, scaled by 3 hours (a reasonable target for the duration of a film):

nearest_ns(x) = round(x * 1,000,000,000) / 1,000,000,000
ceil_ns(x) = ceil(x * 1,000,000,000) / 1,000,000,000
floor_ns(x) = floor(x * 1,000,000,000) / 1,000,000,000
nearest_error(x) = 1 - (x / nearest_ns(x))
ceil_error(x) = 1 - (x / ceil_ns(x))
floor_error(x) = 1 - (x / floor_ns(x))
nearest_error_3h(x) = nearest_error(x) * 60 * 60 * 3
ceil_error_3h(x) = ceil_error(x) * 60 * 60 * 3
floor_error_3h(x) = floor_error(x) * 60 * 60 * 3
e(x) = nearest_error_3h(1 / x)
ce(x) = ceil_error_3h(1 / x)
fe(x) = floor_error_3h(1 / x)

# Integer video frame rates
e(24)        => 8.64e-5
e(25)        => 0
e(30)        => -0.0001
e(48)        => -0.0002
e(50)        => 0
e(60)        => 0.0002
e(120)       => -0.0004

# NTSC video frame rates
e(24/1.001)  => -8.6314e-5
e(30/1.001)  => 0.0001
e(48/1.001)  => 0.0002
e(60/1.001)  => -0.0002
e(120/1.001) => 0.0004

# TrueHD frame rates
e(44100/40)   => -0.0057
e(48000/40)   => -0.0043
e(88200/40)   => 0.0062
e(96000/40)   => 0.0086

# AAC frame rates
e(44100/960)  => -0.0002
e(48000/960)  => 0
e(88200/960)  => 0.0003
e(96000/960)  => 0
e(44100/1024) => 0.0002
e(48000/1024) => -0.0002
e(88200/1024) => -0.0003
e(96000/1024) => 0.0003

# MP3 frame rates
e(44100/1152)  => 8.4375e-6
e(48000/1152)  => 0
e(88200/1152)  => -0.0004
e(96000/1152)  => 0

# Other audio frame rates
e(44100/128)   => -0.0012
e(48000/128)   => 0.0013
e(88200/128)   => -0.0012
e(96000/128)   => -0.0027
e(44100/2880)  => -7.425e-5
e(48000/2880)  => 2.3981e-12
e(88200/2880)  => -7.425e-5
e(96000/2880)  => 2.3981e-12

# GCF of common short-first audio frame sizes
e(44100/64)   => -0.0012
e(48000/64)   => -0.0027
e(88200/64)   => 0.0062
e(96000/64)   => 0.0054

# Raw audio sample rates
e(44100)     => 0.1253
e(48000)     => -0.1728
e(88200)     => 0.1253
e(96000)     => 0.3456

fe(44100)    => -0.351
ce(48000)    => 0.3456
fe(88200)    => -0.8273
fe(96000)    => -0.6912

# MPEGTS time base
e(90000)     => -0.108

ce(90000)    => 0.8639

# Common multiples
e(30000)     => -0.108
e(60000)     => 0.216
e(120000)    => -0.432
e(240000)    => 0.8639
e(480000)    => -1.7283

ce(30000)    => 0.216
fe(60000)    => -0.432
ce(120000)   => 0.8639
fe(240000)   => -1.7283
ce(480000)   => 3.4549

As we can see, rounding common video and audio frame rates (including e.g. the least common multiple of 24 and 60 for that VFR case) produces a negligible amount of error over a reasonable duration. This means that for content where all timestamps can reasonably be expressed in integer values of those rates, there would be no significant error over common file durations, even if different streams were muxed with different time bases.

There are a few real-world time bases that would produce significant rounding error (upwards of 100ms) over the course of 3 hours when used in existing players: MPEGTS's 90000Hz, all common raw audio sample rates, and least-common-multiples between integer and NTSC video frame rates. This essentially means that mixing these rates with others would produce significant desync over a reasonable duration for static on-disk content; the same issue could occur when muxing very lengthy content (e.g. streaming).

All of these issues can be addressed in one of the following ways:

Using a lower rate (e.g. 90,000Hz isn't usually the real content rate but instead an artifact of its previous container; expressing timestamps in samples rather than frames is usually unnecessary)
Choosing the highest of the input rates for all streams (e.g. 48000 is a multiple of many common frame rates, including 24/1.001)
Choosing a more precise common-multiple rate that may create a larger total drift, but does so equally for all streams (see the "Common multiples" section; 1/30000 is suitable for mixing 24fps and 30/1.001fps content alongside most common framed audio rates, while the later listed bases are suitable for increasingly large sets).
Round some tracks' nanosecond timescales in the opposite direction, creating a larger drift, but potentially one with the same sign (and thus a closer value) as the drift in other tracks (this is probably too complex and niche to have substantial use)
Fall back to classic rounded nanosecond-based timestamps (and don't write an integer-fraction time base at all)
Use the extension, resulting in significant sync drift in older players that haven't implemented the change

This last option is usually unacceptable, but may be fine for files that use codecs that become available after the change is made (and thus are unavoidably non-backwards-compatible anyway).

If combined with clear advice in the spec on how muxers SHOULD (or MAY) decide on time bases for various possible input cases, I think this extension could get actual adoption in muxers and solve one of the format's longest-standing problems.

Sep 07 '20 04:09 rcombs

…I just realized I'd misremembered where timescales are specified when writing this (they're on the segment, not the track). Still, the same concept applies, just requiring common-multiple rates (though the TrackTimestampScale element could be used to account for this to some extent; it's deprecated, but all existing players other than MPlayer-derived ones seem to support it).

Sep 07 '20 06:09 rcombs

Hi, I see this was brought up as an issue in the GitHub repository and am cross-posting to the cellar working group.

On Sep 7, 2020, at 12:02 AM, rcombs [email protected] wrote: It's pretty well-established that Matroska's poor timebase support is one of the format's worst properties. While it support very precise timestamps (down to the nanosecond), it's very inefficient to do so (and the resulting values still aren't exact for most input rates), so muxers tend to default to 1ms timestamps, which can lead to a variety of subtle issues, especially with high-packet-rate streams (e.g. audio) and VFR video content. Muxers can choose rates that are closer to the time base of their inputs (or the packet rate of the content), but exactly how best to do so has always been unclear, and some of the possible options would lead to either worse player behavior, or timestamp drift. I'm proposing a format addition to remedy this.

The only actual normative change I propose is this: in addition to the classic nanosecond-denominator time scale, muxers could provide 2 additional integers, serving as a numerator and denominator time base value, which is required to round to the existing nanosecond-scaled value.

This should be paired with some advice for muxer implementations on how to make use of this feature. This depends on the properties of the input. For reference, here are some examples of the error produced by rounding a variety of common time bases to the nearest nanosecond, scaled by 3 hours (a reasonable target for the duration of a film):

nearest_ns(x) = round(x * 1,000,000,000) / 1,000,000,000 ceil_ns(x) = ceil(x * 1,000,000,000) / 1,000,000,000 floor_ns(x) = floor(x * 1,000,000,000) / 1,000,000,000 nearest_error(x) = 1 - (x / nearest_ns(x)) ceil_error(x) = 1 - (x / ceil_ns(x)) floor_error(x) = 1 - (x / floor_ns(x)) nearest_error_3h(x) = nearest_error(x) * 60 * 60 * 3 ceil_error_3h(x) = ceil_error(x) * 60 * 60 * 3 floor_error_3h(x) = floor_error(x) * 60 * 60 * 3 e(x) = nearest_error_3h(1 / x) ce(x) = ceil_error_3h(1 / x) fe(x) = floor_error_3h(1 / x)

Integer video frame rates

e(24) => 8.64e-5 e(25) => 0 e(30) => -0.0001 e(48) => -0.0002 e(50) => 0 e(60) => 0.0002 e(120) => -0.0004

NTSC video frame rates

e(24/1.001) => -8.6314e-5 e(30/1.001) => 0.0001 e(48/1.001) => 0.0002 e(60/1.001) => -0.0002 e(120/1.001) => 0.0004

TrueHD frame rates

e(44100/40) => -0.0057 e(48000/40) => -0.0043 e(88200/40) => 0.0062 e(96000/40) => 0.0086

AAC frame rates

e(44100/960) => -0.0002 e(48000/960) => 0 e(88200/960) => 0.0003 e(96000/960) => 0 e(44100/1024) => 0.0002 e(48000/1024) => -0.0002 e(88200/1024) => -0.0003 e(96000/1024) => 0.0003

MP3 frame rates

e(44100/1152) => 8.4375e-6 e(48000/1152) => 0 e(88200/1152) => -0.0004 e(96000/1152) => 0

Other audio frame rates

e(44100/128) => -0.0012 e(48000/128) => 0.0013 e(88200/128) => -0.0012 e(96000/128) => -0.0027 e(44100/2880) => -7.425e-5 e(48000/2880) => 2.3981e-12 e(88200/2880) => -7.425e-5 e(96000/2880) => 2.3981e-12

GCF of common short-first audio frame sizes

e(44100/64) => -0.0012 e(48000/64) => -0.0027 e(88200/64) => 0.0062 e(96000/64) => 0.0054

Raw audio sample rates

e(44100) => 0.1253 e(48000) => -0.1728 e(88200) => 0.1253 e(96000) => 0.3456

fe(44100) => -0.351 ce(48000) => 0.3456 fe(88200) => -0.8273 fe(96000) => -0.6912

MPEGTS time base

e(90000) => -0.108

ce(90000) => 0.8639

Common multiples

e(30000) => -0.108 e(60000) => 0.216 e(120000) => -0.432 e(240000) => 0.8639 e(480000) => -1.7283

ce(30000) => 0.216 fe(60000) => -0.432 ce(120000) => 0.8639 fe(240000) => -1.7283 ce(480000) => 3.4549 As we can see, rounding common video and audio frame rates (including e.g. the least common multiple of 24 and 60 for that VFR case) produces a negligible amount of error over a reasonable duration. This means that for content where all timestamps can reasonably be expressed in integer values of those rates, there would be no significant error over common file durations, even if different streams were muxed with different time bases.

There are a few real-world time bases that would produce significant rounding error (upwards of 100ms) over the course of 3 hours when used in existing players: MPEGTS's 90000Hz, all common raw audio sample rates, and least-common-multiples between integer and NTSC video frame rates. This essentially means that mixing these rates with others would produce significant desync over a reasonable duration for static on-disk content; the same issue could occur when muxing very lengthy content (e.g. streaming).

All of these issues can be addressed in one of the following ways:

Using a lower rate (e.g. 90,000Hz isn't usually the real content rate but instead an artifact of its previous container; expressing timestamps in samples rather than frames is usually unnecessary) Choosing the highest of the input rates for all streams (e.g. 48000 is a multiple of many common frame rates, including 24/1.001) Choosing a more precise common-multiple rate that may create a larger total drift, but does so equally for all streams (see the "Common multiples" section; 1/30000 is suitable for mixing 24fps and 30/1.001fps content alongside most common framed audio rates, while the later listed bases are suitable for increasingly large sets). Round some tracks' nanosecond timescales in the opposite direction, creating a larger drift, but potentially one with the same sign (and thus a closer value) as the drift in other tracks (this is probably too complex and niche to have substantial use) Fall back to classic rounded nanosecond-based timestamps (and don't write an integer-fraction time base at all) Use the extension, resulting in significant sync drift in older players that haven't implemented the change This last option is usually unacceptable, but may be fine for files that use codecs that become available after the change is made (and thus are unavoidably non-backwards-compatible anyway).

If combined with clear advice in the spec on how muxers SHOULD (or MAY) decide on time bases for various possible input cases, I think this extension could get actual adoption in muxers and solve one of the format's longest-standing problems.

This has been discussed on the list before though I don’t remember clear consensus on how to address this. Steve even compiled a list of discussions on this at https://mailarchive.ietf.org/arch/msg/cellar/ZpZxhG1gML9xVx_ir1Jf6_gcI8U/ https://mailarchive.ietf.org/arch/msg/cellar/ZpZxhG1gML9xVx_ir1Jf6_gcI8U/.

I proposed an option in this https://mailarchive.ietf.org/arch/msg/cellar/mTprgjNqVbe20e6hyYxns8ZnVwY/ https://mailarchive.ietf.org/arch/msg/cellar/mTprgjNqVbe20e6hyYxns8ZnVwY/ where one of the existing reserved bits of the Block Header (in the byte that contains the keyframe, invisible, and lacing flags) be used as a flag for Timescale Alignment.

With this approach, new elements could be added to the track header with a numerator and denominator of a rationale time scale and if Timescale Alignment were set to true, then the nearest increment of the rationale time scale would be used. Example:

Thus if the frame rate of the track header is 120000/1001, then

If Matroska timecode is 4 and Enable TimeScale Alignment is 0, than it is at 4 / (1000000000 / TimecodeScale ). If Matroska timecode is 4 and Enable TimeScale Alignment is 1, than it is at 0 / 1200000 (nearest increment of the rationale frame rate).

If Matroska timecode is 17 and Enable TimeScale Alignment is 0, than it is at 17 / (1000000000 / TimecodeScale ). If Matroska timecode is 17 and Enable TimeScale Alignment is 1, than it is at 2002 / 1200000 (nearest increment of the rationale frame rate).

In a Matroska demuxer doesn’t understand the new nom/denom elements or the Alignment flag then it would simply use the existing nanosecond timestamp system.

In that thread there were other proposals, for example Steve discussed using a float to depict a point in time. Dave

Sep 07 '20 13:09 dericed

Did anyone suggest storing the rounding error as a fraction? (With denominator stored in the track header, this is only 3 bytes per packet in the best case.)

Of course all of these ideas are terrible hacks compared to just storing it in the correct way.

Sep 08 '20 15:09 ghost

On Sep 8, 2020, at 11:52 AM, wm4 [email protected] wrote:

Did anyone suggest storing the rounding error as a fraction? (With denominator stored in the track header, this is only 3 bytes per packet in the best case.)

That sounds interesting: to have the rounding error numerator in each block and the rounding error denominator in the track header. Perhaps a rounding error denominator could also be in the block but defaults to the one within the track header.

Of course all of these ideas are terrible hacks compared to just storing it in the correct way.

Yes, it is a challenge to fix this and maintain reverse compatibility. Dave

Sep 11 '20 15:09 dericed

Did anyone suggest storing the rounding error as a fraction? (With denominator stored in the track header, this is only 3 bytes per packet in the best case.)

I don't like this as storing rounding errors ist imprecise as well (unless the global timestamp scaling factor is a multiple of the rounding error's denominator). I'm also quite unsure which denominator a multiplexer should chose. In order to express a rounding error precisely it must have a much higher resolution that the usual 1ms resolution of Matroska timestamps. For example, with 1001/30000 FPS content the rounding error will always be below one frame duration, therefore you'll have to make the denominator much larger.

Something else that came to mind when reading our previous discussion that Dave linked to: please keep in mind that any solution that sets values for the whole track in the track header will inevitably fail with mixed frame rate content or content with different interlacing, e.g. when multiplexing from an MPEG transport stream recorded from a DVB broadcast. Those bloody streams change frame rates all the time when the program changes, e.g. when transitioning to and from commercials (or just from an announcement to the movie). With our new and shiny precise timestamp calculation we'll either have to forbid such changes (unrealistic) or provide facilities to signal such changes in the form of some type of global index similar to cues. Unlike cues, though, such an index would have to be mandatory (a file without cues can be played just fine, even seeking works similar to seeking in Ogg files — meaning some kind of binary search).

File types whose timestamps are based solely on a stream's regular sampling frequency (MP4 usually is, but doesn't have to; Ogg does, too) all share those issues. MPEG TS on the other hand uses a 90 KHz-based clock which is fine for most video stuff but doesn't have enough resolution for sample-precision timing of audio tracks with high sampling frequency.

… in the correct way.

Due to what I've written above I'm pretty sure that there is no one correct way to store timestamps for a general purpose container that allows its content to change its time base in the middle.

In theory Matroska's timestamps can have sample-precision already (just make global timestamp scale small enough to match all of the tracks' time bases). The problem is with the waste of space that follows due to the bloody 16-bit integer offset in Block & SimpleBlock.

So if we're thinking about breaking compatibility anyway, why not think about a whole new SimpleBlock V2 that allows for much larger relative timestamps? Would make all existing players incompatible, though.

Another idea that only wastes space but doesn't destroy existing players' ability to play the file: adding a new child to Block called PreciseRelativeTimestamp or whatever that contains the difference between the timestamp-scaling-based timestamp & the actual, precise one, in nanoseconds. Cannot be used with SimpleBlocks, of course. Will take several bytes per BlockGroup.

Sep 11 '20 15:09 mbunkus

I don't like this as storing rounding errors ist imprecise as well (unless the global timestamp scaling factor is a multiple of the rounding error's denominator).

It can be 100% exact. It's the rounding error after all - the number that needs to be added to the "classic" ms timestamp to get the fractional timestamp

I'm also quite unsure which denominator a multiplexer should chose. In order to express a rounding error precisely it must have a much higher resolution that the usual 1ms resolution of Matroska timestamps. For example, with 1001/30000 FPS content the rounding error will always be below one frame duration, therefore you'll have to make the denominator much larger.

It seems the denominator of the rounding error is simply the denominator of the original timestamp. E.g. in this case, the rounding error would have denominator 30000 and nominator (n*1001/30000 - int(n*1001/30000*1000)/1000) * 30000) for frame n or something like this. This is probably wrong, just typing this out casually. Actually probably also needs a constant nominator part (to be stored in the track header) of 1001.

Something else that came to mind when reading our previous discussion that Dave linked to: please keep in mind that any solution that sets values for the whole track in the track header will inevitably fail with mixed frame rate content or content with different interlacing, e.g. when multiplexing from an MPEG transport stream recorded from a DVB broadcast. Those bloody streams change frame rates all the time when the program changes, e.g. when transitioning to and from commercials (or just from an announcement to the movie). With our new and shiny precise timestamp calculation we'll either have to forbid such changes (unrealistic) or provide facilities to signal such changes in the form of some type of global index similar to cues. Unlike cues, though, such an index would have to be mandatory (a file without cues can be played just fine, even seeking works similar to seeking in Ogg files — meaning some kind of binary search).

What does Matroska do if the codec changes? Transport streams can do that, Matroska can't do that. I feel like bringing up such cases just complicates the whole discussion. You can't fix everything at the same time. But you can stall any progress by wanting to consider every possible future feature and requirement.

Besides, as was suggested in a previous post, the denominator part could be overridden per packet. This would cause some bytes of overhead in such obscure cases as mixing multiple framerates that are not known in advance.

In theory Matroska's timestamps can have sample-precision already (just make global timestamp scale small enough to match all of the tracks' time bases). The problem is with the waste of space that follows due to the bloody 16-bit integer offset in Block & SimpleBlock.

I guess you mean the fact that every packet will need its own cluster. But AFAIK that still doesn't give a way to get fractional timestamps? So, not an option.

So if we're thinking about breaking compatibility anyway, why not think about a whole new SimpleBlock V2 that allows for much larger relative timestamps? Would make all existing players incompatible, though.

Obviously not an option. If it were specified, it's likely everyone would disable this by default, except people who use Matroska in special setups where they control producer and consumer.

Another idea that only wastes space but doesn't destroy existing players' ability to play the file: adding a new child to Block called PreciseRelativeTimestamp or whatever that contains the difference between the timestamp-scaling-based timestamp & the actual, precise one, in nanoseconds. Cannot be used with SimpleBlocks, of course. Will take several bytes per BlockGroup.

I thought that was what I proposed here (except I wanted to use fractional numbers).

Sep 11 '20 16:09 ghost

PS: I think obsessing about a few bytes per packet isn't useful. Having precise timestamps, even if it introduces overhead, is much more important. Nobody will discard Matroska as an option because it doesn't go to the edge of the theoretically possible for saving overhead.

Sep 11 '20 16:09 ghost

What does Matroska do if the codec changes? Transport streams can do that, Matroska can't do that.

True. The difference is that having multiple time bases in the same track is something that exists & works today.

I'm really not trying not prevent progress here, and I'm not talking about each and every possible situation. I am talking about one specific situation that is in wide-spread use today.

What I am trying to prevent is implementing a scheme that's supposed to improve one aspect that simultaneously makes another aspect worse. Hence me talking about ways to signal a change in time base mid-stream. We'd also have to signal a precise timestamp at the point of change in time base so that the player can reconstruct the whole timeline properly without having to read blocks at each change in time base.

Sep 11 '20 16:09 mbunkus

It seems like there are a few ways discussed to correct this:

Express the time-base in the track so the demuxer can adjust the timestamps in the file to the closest increment of the time-base
Express a fractional error value using a denominator in the track and numerator in the packet so the demuxer can give more precise timestamps
1. Potentially allow overriding the denominator on a per-packet basis
Express a second timestamp using a fractional time-base stored in the track
1. Potentially allow just expressing the timestamp in a numerator/denominator so as to ignore/override the track's time-base

All of these would still require the current timestamp to still exist and thus would be compatible with current demuxers but newer demuxers would be able to read/derive more precise timestamps.

It seems the denominator of the rounding error is simply the denominator of the original timestamp.

Close. When I saw this first suggested, I did some math and figured out what it would be for the case of 44.1kHz AAC audio (this is what really sparked this conversation; see below). In this case the samples are 1024/44100 seconds long with the MKA using 1ms precision on the timestamps and so the error can be expressed as m/1000 - n*1024/44100 where m is the timestamp in the MKA and n is the packet number. To express the error exactly in integers, the denominator is lcm(1000, 44100) (your basic fractions with common denominator) which in this case is 441000. Using some quick examples:

Packet number	MKA timestamp	Error (using 441,000 as the denominator)
0	0	0
1	23	97
2	46	194
3	70	-150
354	8220	-60
355	8243	37

Also worth noting: the duration would likely need the same treatment.

Aside: I sparked this conversation in an internal discussion with @rcombs about AAC 44.1kHz audio in an MKA format. I was remuxing this to MPEG-TS and the MKA had only 1ms precision timestamps. Well, a simple remux would be multiplying these timestamps by 90 to match MPEG-TS. This simple remux resulted in packets at times 0, 2070, 4140, 6300, 8370 … which gave them effective durations of 0, 2070, 2070, 2160, 2070 and this inconsistency would cause stuttering audio in Apple's HLS demuxer. So this meant that remuxing MKA -> MPEG-TS required opening a codec in lavc to get more precise durations and thus derive timestamps without error.

P.S. These imprecise timestamps were one of the more annoying things we had to deal in Perian's MKV demuxer and that was over a decade ago.

Sep 11 '20 19:09 gbooker

The reason not to introduce a SimpleBlock v2 is that any hardware/software player that doesn't know it won't be able to play the files. It can be done in Matroska v5. Such files being unreabable by v5 parsers will also be marked as such. We might as well call it Matroska2 or something, just like WebM shares a lot with Matroska.

The practical question is whether there is a convenient way to have precise timestamps in v4 and make it work in existing players (and WebM, I know that's something they want as well).

The question about VFR (Variable Frame Rate) is not really an issue IMO. In the end you only have 1 or 2 frame rates mixed, and maybe with the same denominator. All you need is a fraction that handle both. Facebook even created a timebase that covers most common timebases for video. As long as you know the timebases you'll have to deal with before muxing you should be fine.

An important thing to note is that floating point should not be used at all (we want precision). All we have is the Matroska timebase x/1.000.000.000 s (x=TimestampScale) and the source material timebase(s) (a/44.100, b/48.000, c/24, d*10001/30000, etc). They are all fractions. So we should be able to find something that works with just fractions, using common denominators, fraction reduction, etc. It can get to large numbers very quickly as there are multiple tracks with different timebase (or odd fractions when a track uses VFR, see above).

What we have now is a timestamp for each Block as a fraction of TimestampScale/1.000.000.000.

What we want is a timestamp for each Block as a fraction of the source material. The difference between the two values is still a fraction. We can store this difference as a fraction. And we must also store the source material fraction.

Now we just have to do the math the find this "difference as a fraction". In particular to minimize the storage needed to do so if possible (if not, mandating BlockGroup for precise tracks is always an option). If we can fit in inside the (3) reserved bits of the SimpleBlock it would be perfect.

Sep 13 '20 09:09 robUx4

ISO/IEC 14496-12 "ISO base media file format" uses a "timescale" (counts per second) and "media sample durations". If timescale=30000 and media sample duration is 1001, you get NTSC fractional frame rate.

Similarly, ISO/IEC 14496-10 "Advanced Video Coding" has a clock tick defined as num_units_in_tick divided by the time_scale (see equation C-1). The presence of these in VUI is indicated by the timing_info_present_flag. For NTSC, time_scale may be 30000 and num_units_in_tick may be 1001.

Sep 14 '20 16:09 t11s

Following my "pure rational numbers" approach we can say the following, for a Track sampled at the original frequency, stored in a Matroska Segment with TimestampScale:

The real timestamp for each sample S is: real(S) = S / frequency

The Matroska timestamp for the same sample is matroska(S) = S * TimestampScale / 1,000,000,000

The Cluster timestamp is just a value to add to S to get the proper value, so we can skip it for now. As we just check the rational values, the rounding introduced by divisions is not taken in account.

The difference between the real timestamp and the one we get from Matroska is:

real(S) - matroska(S)
= S / frequency - S * TimestampScale / 1,000,000,000
= (S * 1,000,000,000) / (frequency * 1,000,000,000) - (S * TimestampScale * frequency) / (frequency * 1,000,000,000)
= S * (1,000,000,000 - TimestampScale * frequency) / (frequency * 1,000,000,000)

We can already deduce a few things from this:

The error grows linearly with the value of S.
if TimestampScale is exactly 1,000,000,000 / frequency, there is no error.
the bigger the rounding error of 1,000,000,000 / frequency, the bigger the Matroska and real timestamps will diverge.

That gives some sampling frequencies where it's possible to achieve 0 error per sample:

Audio: 8000 Hz, 16000 Hz, 32000 Hz, 50000 Hz, 64000 Hz,
Video: 25 fps, 50 fps, 100 fps

That leaves out a lot of common ones:

Audio: 11025, 22050, 37800, 44056, 44100, 48000, 50400, 88200, 96000, 176400, 192000, 352800
Video: 16, 24000/1001, 24, 30000/1001, 30, 60000/1001, 48, 60, 90, 100, 120000/1001, 120

The other way to reduce the error, is to reduce the value of S. We already effectively reduce the value we store to a 16 bits integer, so the value is always between -32,768 and 32,767. If we were to store the error in the remaining 3 bits of a SimpleBlock that's still 13 bits too many.

By limiting the possible values of S in a Cluster to [-4,3] (3 bits), in other words 8 frames, it is possible to store each frame with the Matroska timestamp and the error based on TimestampScale * frequency. This is also feasible because audio is usually not found in 1 samples, but by chunks of samples in one Frame. Sometimes all chunks have the same amount of samples, sometimes not, but each amount of samples is based on the same multiple (worst case scenario is many chunk sizes unrelated). For video that means at most 8 frames per Cluster, for a 29,97 fps file that's 267ms. This is very small.

A Block has one extra free bit, so we could double these values. That's still very small IMO. And that's the case where the TimestampScale is precisely adjusted for one track. When you have 2 or more, finding a value of TimestampScale that works well with all frequency becomes even harder.

I think the scope where it works, even with the proper muxing guidelines, is too narrow to be worth using all the reserved bits. In particular because common frequencies like 44100 Hz or 30000/1001 fps will introduce errors no matter what and will need to use this system.

There could be other clever ways to do this. We could use a bit in the Block that says the timestamp "shift" is stored after/before the Block data, but that would be incompatible with all existing readers. That would be equivalent to using a new lacing format.

Another way would be to force using a BlockGroup to have precise timing and store the "shift" in a new element. It might only need 16 bits of storage, so that would translate in 3 extra octets per BlockGroup

Sep 27 '20 08:09 robUx4

It seems one of the aspect of this not discussed is how the rounding of the current system works and how it could be adapted. We assume that we start with the current system and try to fit the correct fraction in there. We may do the other way around, ie have the fraction and use that to set the Block/Cluster timestamp value. The rounding error is then on older parsers assuming a timestamp value when in fact it's another value. But the old system is already known to be imprecise/inaccurate. It's not assumed to be sample precise. So a little more, a little less rounding error should not be a big deal.

What we cannot really do is add some information per-track to modify how the Block/SimpleBlock values are interpreted. That would break backward compatibily. For that we would need BlockV2 and SimpleBlockV2.

So we could store the TimestampScale and a fraction that is the actual fraction it's based one.

Let's see what happens for 29.97fps video, or 30000/1001 Hz. The most accurate TimestampScale is 33,366,667 (nanosecond per frame/lace, rounded). We also store the Segment timestamp fraction as {30000, 1001}:

Frame Number	New Block Value	Old Parser timestamp	Real timestamp	Difference
0	0	0 ns	0 ns	0 ns
1	1	33366667 ns	33366666 ns	1 ns
2	2	66733334 ns	66733333 ns	1 ns
3	3	100100001 ns	100100000 ns	1 ns
4	4	133466668 ns	133466666 ns	2 ns
5	5	166833335 ns	166833333 ns	2 ns
6	6	200200002 ns	200200000 ns	2 ns
7	7	233566669 ns	233566666 ns	3 ns
8	8	266933336 ns	266933333 ns	3 ns
9	9	300300003 ns	300300000 ns	3 ns
10	10	333666670 ns	333666666 ns	4 ns
..	..	..	..	..
65532	65532	2186584421844 ns	2186584400000 ns	21844 ns
65533	65533	2186617788511 ns	2186617766666 ns	21845 ns
65534	65534	2186651155178 ns	2186651133333 ns	21845 ns
65535	65535	2186684521845 ns	2186684500000 ns	21845 ns

The Old Parser timestamp is the timestamp older parsers would see: Block Value * TimestampScale. The Real timestamp is the one using the fraction: Block Value * 1001 / 30000.

For 44100 Hz audio we get the following, with a TimestampScale of 22,676 (nanosecond per frame/lace, rounded).

Frame Number	New Block Value	Old Parser timestamp	Real timestamp	Difference
0	0	0 ns	0 ns	0 ns
1	1	22676 ns	22675 ns	1 ns
2	2	45352 ns	45351 ns	1 ns
3	3	68028 ns	68027 ns	1 ns
4	4	90704 ns	90702 ns	2 ns
5	5	113380 ns	113378 ns	2 ns
6	6	136056 ns	136054 ns	2 ns
7	7	158732 ns	158730 ns	2 ns
8	8	181408 ns	181405 ns	3 ns
9	9	204084 ns	204081 ns	3 ns
10	10	226760 ns	226757 ns	3 ns
..	..	..	..	..
1636	1636	37097936 ns	37097505 ns	431 ns
1637	1637	37120612 ns	37120181 ns	431 ns
1638	1638	37143288 ns	37142857 ns	431 ns
1639	1639	37165964 ns	37165532 ns	432 ns
1640	1640	37188640 ns	37188208 ns	432 ns
..	..	..	..	..
65532	65532	1486003632 ns	1485986394 ns	17238 ns
65533	65533	1486026308 ns	1486009070 ns	17238 ns
65534	65534	1486048984 ns	1486031746 ns	17238 ns
65535	65535	1486071660 ns	1486054421 ns	17239 ns

The difference is less than one sample. When packed at 40 samples per frames (shortest packing in @rcombs example). We would then use a fraction of {40, 44100} and a TimestampScale of 907029 :

Frame Number	New Block Value	Old Parser timestamp	Real timestamp	Difference
0	0	0 ns	0 ns	0 ns
1	1	907029 ns	907029 ns	0 ns
2	2	1814058 ns	1814058 ns	0 ns
3	3	2721087 ns	2721088 ns	-1 ns
4	4	3628116 ns	3628117 ns	-1 ns
5	5	4535145 ns	4535147 ns	-2 ns
..	..	..	..	..
47392	47392	42985918368 ns	42985941043 ns	-22675 ns
47393	47393	42986825397 ns	42986848072 ns	-22675 ns
47394	47394	42987732426 ns	42987755102 ns	-22676 ns
47395	47395	42988639455 ns	42988662131 ns	-22676 ns
47396	47396	42989546484 ns	42989569160 ns	-22676 ns
47397	47397	42990453513 ns	42990476190 ns	-22677 ns
..	..	..	..	..
65533	65533	59440331457 ns	59440362811 ns	-31354 ns
65534	65534	59441238486 ns	59441269841 ns	-31355 ns
65535	65535	59442145515 ns	59442176870 ns	-31355 ns

We get less than 1 sample error with 47393 frames stored, or 42s worth of samples in a Cluster.

The worst case scenario is the highest, not easily divisible, frequency 352800. It gives:

Frame Number	New Block Value	Old Parser timestamp	Real timestamp	Difference
0	0	0 ns	0 ns	0 ns
1	1	2834 ns	2834 ns	0 ns
2	2	5668 ns	5668 ns	0 ns
3	3	8502 ns	8503 ns	-1 ns
4	4	11336 ns	11337 ns	-1 ns
5	5	14170 ns	14172 ns	-2 ns
..	..	..	..	..
6066	6066	17191044 ns	17193877 ns	-2833 ns
6067	6067	17193878 ns	17196712 ns	-2834 ns
6068	6068	17196712 ns	17199546 ns	-2834 ns
6069	6069	17199546 ns	17202380 ns	-2834 ns
..	..	..	..	..
65533	65533	185720522 ns	185751133 ns	-30611 ns
65534	65534	185723356 ns	185753968 ns	-30612 ns
65535	65535	185726190 ns	185756802 ns	-30612 ns

Here we achieve less than on sample error when there is less than 6067 samples in a Cluster. This can be doubled by using signed values for the Block timestamp value. The range to get less than one sample error becomes [-6067,6066]. And by packing samples by at least 11 samples, we always get less than 1 sample error. With 22 samples we get less than half a sample duration error, which should be enough with rounding.

So with single track files we can probably achieve sample precision easily.

With mixed frequencies it becomes more complicated. For example the 29.97 fps video with the 44100 Hz audio. We have 1001/30000 and 1/44100 so the fraction to use would be 1001/reduced(30000, 44100) where reduced(A, B) is each number multiplied and divided by their Greatest Common Denominator. In this case (30000*44100)/100 = 13230000. That gives a round TimestampScale of 75661 ns/tick.

That gives these Blocks:

Block Value	Old Parser timestamp	Real timestamp	Difference
0	0 ns	0 ns	0 ns
1	75661 ns	75661 ns	0 ns
2	151322 ns	151322 ns	0 ns
3	226983 ns	226984 ns	-1 ns
4	302644 ns	302645 ns	-1 ns
5	378305 ns	378306 ns	-1 ns
6	453966 ns	453968 ns	-2 ns
7	529627 ns	529629 ns	-2 ns
8	605288 ns	605291 ns	-3 ns
9	680949 ns	680952 ns	-3 ns
10	756610 ns	756613 ns	-3 ns
11	832271 ns	832275 ns	-4 ns
12	907932 ns	907936 ns	-4 ns
..	..	..	..
441	33366501 ns	33366666 ns	-165 ns
..	..	..	..
882	66733002 ns	66733333 ns	-331 ns
..	..	..	..
1323	100099503 ns	100100000 ns	-497 ns
..	..	..	..
32634	2469121074 ns	2469133333 ns	-12259 ns

For the video track we would get something like this:

Frame Number	New Block Value	Old Parser timestamp	Real timestamp	Difference
0	0	0 ns	0 ns	0 ns
1	441	33366501 ns	33366666 ns	-165 ns
2	882	66733002 ns	66733333 ns	-331 ns
3	1323	100099503 ns	100100000 ns	-497 ns
..	..	..	..	..
74	32634	2469121074 ns	2469133333 ns	-12259 ns
..	..	..	..	..
148	65268	4938242148 ns	4938266666 ns	-24518 ns

We can store almost 5s In a Cluster.

For the audio track, on the other hand, we cannot recover each sample easily

Sample Number	Real timestamp	Block Value
0	0 ns	0
1	22675 ns	~0
2	45351 ns	~1
3	68027 ns	~1
4	90702 ns	~1
5	113378 ns	~1
6	136054 ns	~2
7	158730 ns	~2
..	..	..
300	6802721 ns	~90

The Block Value doesn't map to an exact sample timestamp (and vice versa).

It seems that if we apply a factor of 3 we may get better results. So we could have a Segment fraction of 1001/3*13230000, with a rounded TimestampScale of 25220 ns/tick.

Sample Number	Block Value	Old Parser timestamp	Real timestamp	Difference
0	0	0 ns	0 ns	0 ns
1	1	25220 ns	22675 ns	2545 ns
2	2	50440 ns	45351 ns	5089 ns
3	3	75660 ns	68027 ns	7633 ns
4	4	100880 ns	90702 ns	10178 ns
5	5	126100 ns	113378 ns	12722 ns
6	6	151320 ns	136054 ns	15266 ns
7	7	176540 ns	158730 ns	17810 ns
8	8	201760 ns	181405 ns	20355 ns
9	9	226980 ns	204081 ns	22899 ns
10	10	252200 ns	226757 ns	25443 ns
..	..	..	..	..
65534	65534	1652767480 ns	1486031746 ns	166735734 ns
65535	65535	1652792700 ns	1486054421 ns	166738279 ns

We lose about 1 sample precision every 10 samples, or 10%. For a full Block that's about 166ms shift (or rather half when using signed 16 bits). That's a lot. Even packed at 40 samples per frame that's still about 20ms, when such a frame is 1 ms.

If we use the full fraction {1001, 30000*44100} we cannot store more than one video frame per Cluster.

There doesn't seem to be a system where it works by storing the Block value as a real fraction value. At least when mixing "heterogeneous" frequencies. It works with single tracks or frequency that are easily divisible. And not if we want to keep backward compatibility (Block/SimpleBlock).

Oct 04 '20 09:10 robUx4

A little background on this, for adaptive streaming it's important when you switch to one "quality" (representation) to another one to switch exactly the frame and audio you want. I don't know if they are sample exact for audio, especially as each codec (or different encoding parameters) may pack a different amount of samples per frame. So the boundaries don't totally overlap. Maybe there's an offset that tells on which sample to start. Or an exact clock gives the exact timestamp for each sample in each representation anyway.

Give that, the important phrase here is

So with single track files we can probably achieve sample precision easily.

In adaptive streaming you don't (usually) use muxed tracks. So you can pick each channel independently with the best possible choice at any given time. So in these conditions we can be sample precise. All we need is to tell the original clock (numerator/denominator) of the Track. A new parser would use that value with the Block timestamp value. Older parsers would not see it and would use the Block timestamp value with the global TimestampScale. As described above, the difference is minimal, as long as the TimestampScale is matching the fraction.

I'll send a proposal for new elements to store this fraction and the necessary changes on how to interpret the timestamps.

Oct 11 '20 07:10 robUx4

The larger problem is because we want a rationale number that works for all the tracks (theoretically possible) and at the same time have a sensible value that will not require huge values of the numerator for each timestamp in a Block. We only have 16 bits there. As seen above in most cases it doesn't work. And that's because we have one global "clock", defining all Block (and Cluster and more) "ticks".

We could however alter the interpretation of each Block value to adjust to a better "clock" that works for that track. So that we end up with a better range of values for the numerator. And luckily we already have TrackTimestampScale ! It's a float number to apply to each Block tick value to get the proper timestamp for that Block (or Track in general). It is currently marked as deprecated because it's usage was limited, as it's a float, and it was supposed to allow changing timestamps without remuxing a track. But that's not convenient at all.

But just like we introduce a rational number to use instead of the TimestampScale, we can use a rational number to use instead of TrackTimestampScale. And with TrackTimestampScale the rounded value of this rational number. Despite being deprecated, TrackTimestampScale is supported in at least libavformat (ffmpeg) and VLC demuxers. It's possible it's not supported in a lot of demuxers, especially since it was marked as deprecated anyway. For example it's not supported in WebM. But that's less of a problem as they tend to add new elements when they need it.

So a Block timestamp would be ( ( Block tick * TrackTimestampScale ) + Cluster tick ) * TimestampScale The formula in the old website (and current RFC draft) is incorrect as it applied the TrackTimestampScale on the Cluster tick as well. The vlc code seems to use it incorrectly (I can fix) but the libavformat seems to be correct. In both case adding support for sample accurate timestamps would mean fixing those as well.

In a new parser TimestampScale and TrackTimestampScale would both be rational numbers. In old parser the TimestampScale would be the rounded nanosecond based value and TrackTimestampScale the floating point value value of TrackTimestampScale. They would be less precise but they were never meant to be anyway.

Oct 18 '20 06:10 robUx4

So let's take the previous example that didn't work: 29.97 fps video with the 44100 Hz audio. Now we can have TimestampScale * TrackTimestampScale = 1001/30000 for video and TimestampScale * TrackTimestampScale = 1/44100 for audio (or 40/44100 if samples are always packed by 40 but we don't even need that). We can represent 65536 ticks for each Track in a Cluster.

Now the critical part is the Cluster tick value. To have sample accurate values on each Block it also has to provide ticks that are sample accurate for both tracks. In this case a (rational) TimestampScale of 1/(30000 * 441) should do it. All ticks on the 1/44100 clock are represented (0, 300, 600, 900, 1200, etc) on this clock. All ticks on the 1001/30000 clock are also represented (0, 1001 * 441, 2 * 1001 * 441, 3 * 1001 * 441, etc) on this clock. In a 24h movie that's 24 * 60 * 60 * 30000 * 441 ticks which is still a small value (0x10A 2466 8800 in hexadecimal) compared to the 64 bits room we have for each Cluster Timestamp.

There is a slight problem though. The rounded TimestampScale would be 76 ns. Over 24h the "old clock" tick would be 24 * 60 * 60 * 30000 * 441 * 76 ns or 86,873.472 s or 24,13 h. That's a 0.548% error. In general that system is used with a 1ms precision resulting in even more innacurate values for the 33,366 ms video durations. So it shouldn't have any impact.

Now what is the magic formula to get the proper rational TimestampScale (TimestampNumerator and TimestampDenominator) ? It looks like TimestampDenominator = SamplingFreq A Denominator * SamplingFreq V Denominator / GCD( SamplingFreq A Denominator, SamplingFreq V Denominator ) Where the GCD() function gives the Greatest Common Denominator for A and V. But that's not the value we used. Both 30000 and 441 are divisible by 3. So it should be 10000 * 441. That gives a legacy TimestampScale of 227 ns, which should give a smaller difference between the 2 systems.

with this value the rational TrackTimestampScale values would be 100/1 for audio and (1001 * 147)/1 for video, also stored as floating values in the legacy field. For audio the Block ticks would result in:

Block timestamp = ( ( Block tick * TrackTimestampScale ) + Cluster tick ) * TimestampScale
Block timestamp = ( ( Block tick * 100/1 ) + Cluster tick ) * 1/(10000 * 441)
Block timestamp = ( Block tick * 100/1 ) * 1/(10000 * 441) + Cluster tick * 1/(10000 * 441)
Block timestamp = Block tick * 100 * / (10000 * 441) + Cluster tick * / (10000 * 441)
Block timestamp = Block tick / (100 * 441) + Cluster tick / (10000 * 441)
Block timestamp = Block tick / 44100 + Cluster tick / (10000 * 441)

For video the Block ticks would result in:

Block timestamp = ( ( Block tick * TrackTimestampScale ) + Cluster tick ) * TimestampScale
Block timestamp = ( ( Block tick * 1001 * 147 ) + Cluster tick ) * 1/(10000 * 441)
Block timestamp = ( Block tick * 1001 * 147 ) * 1/(10000 * 441) + Cluster tick * 1/(10000 * 441)
Block timestamp = ( Block tick * 1001 * 147 ) / (10000 * 441) + Cluster tick / (10000 * 441)
Block timestamp = Block tick * 1001 / 30000 + Cluster tick / (10000 * 441)

It seems we have a system that works well for two tracks. It work just as much with more tracks as long as the GCD of all SamplingFreq Denominator is big enough, resulting in a rounded legacy TimestampScale that should be above 50 and MUST NOT be 1 or even less 0 anyway.

Oct 18 '20 07:10 robUx4

There is a small problem with the audio in the example above, we only get 65536/44100 second possible per Cluster. But given audio samples are usually packed by a fixed number of samples or a variable number of samples with a base common number, or even a multiple of 4. That packing unit number can be set as the numerator of the audio TrackTimestampScale which would then me Packing Unit * 100 / 1. That multiplies the possible amount of audio per Cluster. Even a Packing Unit of 4 would give 5.9s audio samples per Cluster which is good enough.

Oct 18 '20 08:10 robUx4

So what happens when using only the legacy values to compute the timestamps. In the example above, the TimestampScale is 227 ns. The audio TrackTimestampScale is 100.0f and the video TrackTimestampScale is 147147.0f. The first audio ticks are represented like this

Audio Tick	Real timestamp (ns)	Block Value	Timestamp (ns)	Difference
0	0.0	0	0	0.0
1	22675.7	1	22700	-24.3
2	45351.5	2	45400	-48.5
3	68027.2	3	68100	-72.8
4	90702.9	4	90800	-97.1
5	113378.7	5	113500	-121.3
..	..	..	..	..
65533	1486009088.0	65463	1486010112	-1024.0
65534	1486031744.0	65464	1486032768	-1024.0
65535	1486054400.0	65465	1486055552	-1152.0

The Block value being the integer stored in the Block based on the real timestamp, the TimestampScale and the TrackTimestampScale. The second timestamp being the timestamp a parser would deduce from the Block Value, , the TimestampScale and the TrackTimestampScale. The difference between the deduced and and real timestamp happens because the 227 ns TimestampScale is not an exact value.

In the end, in the whole Cluster, the difference is always less than 11392 ns. That's less than one audio tick. That means even without adding any element, just reviving the floating point TrackTimestampScale, we could store sample accurate timestamps. We just need to apply the rules above to find the proper TimestampScale and the TrackTimestampScale, as if they were handled as rational numbers.

Oct 18 '20 09:10 robUx4

We could however add the original clock in each track to give an accurate way for the reader to round the values (ie, get the values from the second column, when the values of the fourth column are computed). We have the SamplingFrequency but it's in floating point. We should also do the same for video tracks. So probably a rationale value stored with the generic TrackEntry fields.

Oct 18 '20 09:10 robUx4

I did a test program to run different scenarii: all the audio/video sampling frequencies listed above mixed (1 audio/1 video). The program can be found here.

The result of the run of this program is found in this dirty Markdown file.

In some cases there are some rounding errors that can't be recovered. There are also many cases where the possible duration of audio in a Cluster is way too small. So I added some examples with common packing and then the duration is much more usable. And then it also avoids the rounding errors (see "Audio 11025 Hz (128 packed)" for example).

The video errors are always negligible as their occur after very long durations. Durations that are impossible to reach given the duration constraints on the audio.

Maybe I can add an extra layer and try the common packing sizes mentioned by @rcombs. But from a first look it seems to solve both the limited duration of audio in a Cluster and the possible rounding errors.

Oct 18 '20 16:10 robUx4

After computing the TrackTimestampScale of each track as a floating value, rather than an integer (to match what a rationale value would be), we can counter the rounding error introduced by the small TimestampScale in the most tricky cases.

In the end the only errors (half a tick, so the wrong sample/tick would be assumed on the output of the demuxer) only occurs on video tracks, in rare cases and after long duration in a Cluster (145s minimum which is a lot).

The only problem remaining is that the amount of audio samples possible in a Cluster with such small TimestampScale values is small. Sometimes only 0.19 second is possible (16 bits ticks). That can be solved by packing samples to achieve a possible duration in a Cluster over 5s (the common amount acceptable). For the cases where only 0.19s is possible (352800 Hz) packing at least by 263 samples should be sufficient. In most case even packing 10 samples is sufficient.

Nov 01 '20 10:11 robUx4

In the end the packing problem is directly related to the sampling frequency of the audio. This problem exists regardless of the sample accuracy of timestamps. High sampling frequency requires enough packing of samples to fit a useful duration in a Cluster.

This problem aside, we can always use TrackTimestampScale with a rounded TimestampScale (based on audio denominator * video denominator / GCD to achieve sample accuracy.

Mixing more than one audio track might cause some problems if the sampling frequency differ too much (doesn't fit the GCD). But for 2 tracks it's achievable all the time.

Nov 01 '20 10:11 robUx4

I made a small calculation error in my tests as the original sample frequency numerator was not used to compute the real timestamp. With examples where the numerator was artificially inflated (1/24 = 1000/24000 = 2000/48000) to try to match audio ranges it gave an incorrect error. In fact in all cases, there is no error on the audio or video tracks.

The TrackTimestampScale countering the rounding of the TimestampScale is so efficient, it works even with a TimestampScale of 1 ns in all tests. Which means it also works regardless of the number of tracks and their sampling frequencies. It could even be used for frequencies higher than a GHz (< 1 ns period).

Nov 08 '20 07:11 robUx4

So the real problem left if the amount of audio possible per Cluster. As said before, High sampling frequency requires enough packing of samples to fit a useful duration in a Cluster. This is not a new problem. And the TrackTimestampScale has no effect on this (I double checked). There are 65536 ticks per Cluster per Track possible (always with the proper TrackTimestampScale, now it always has an optimum range).

Audio codec usually pack samples with a fixed amount of samples (or a few possible fixed values that may change in the same stream). Raw audio can do the same. In this case we can compute the TrackTimestampScale based on the "packing frequency" rather than the sampling frequency. For example the "packing frequency" of 10 samples packed at 44100 Hz would be 4410 Hz. Allowing 10x more duration per Cluster. In other words, each ticks is worth 10x more duration than without packing.

If we consider it should always be possible to store at least 5s of audio per Cluster, then the problem starts at frequencies higher than 13107 Hz (65536 ticks / 5 s). That's pretty much all the time.

With packing we don't get the timestamp of each individual sample. We only get the timestamp of the first sample of each pack of sample. But since we know the sampling frequency of the audio (SamplingFrequency element) we can tell the exact timestamp of all the other samples.

Nov 08 '20 08:11 robUx4

For Variable Framerate of video (I suppose it's rare for audio) there is another problem. There isn't one frequency. For example there might be film source (24 fps) and NTSC video source (29.97 fps) mixed in the same Segment. Or video captures that are sometimes at 60 fps, sometimes 144 fps and sometimes values that just occur when they can (if the game is able to control the V-Sync directly).

I suppose other containers will also have a hard time giving an exact timestamp value for each frame.

In the case of 2 fixed sources mixed together, it should be possible to accommodate the TrackTimestampScale to both using the rationale fractions. It will reduce the duration possible for that track. But even for 121 and 123 fps that gives a mixed frequency of 14 883 Hz (with ticks from one or the other falling on exact ticks of this clock). That's more than 13107 Hz which means we can't store 5s in a Cluster, but it's pretty close (4.4 s).

For too many or too heterogeneous sources there's not really a good solution. But these sources are doomed to never have accurate timestamps anyway. In that case a resolution of 0.1 ms (10000 Hz) should give a good estimate and enough duration (6.5 s) per Cluster.

Nov 08 '20 08:11 robUx4

Given all this I think #437 is a good all around solution. It may not even require to store the exact fraction of the original (although it's probably needed to remux into other containers).

TrackTimestampScale has been in the Matroska specs forever and is supposed to be used by demuxers. So extending it to newer versions of Matroska should be a no brainer. Unfortunately there are high chances that's it's not used properly. Since noone really used it so far (AFAIK) it's usually assumed to be 1.0, and it's discarded, all the math on timestamps being done with integers.

The proposed solution radically changes that. Almost all the time the TrackTimestampScale has a value very far from 1.0 (up to 10416667 in the frequencies I tested). For all parsers not using the TrackTimestampScale, only the first timestamp of a Cluster will be usable (tick 0) the rest will look very odd (usually way too small values). It should always be possible to adjust the TimestampScale so that one track has a TrackTimestampScale of 1.0. It should be the track with the highest "packing frequency". All other tracks will be almost unusable to a non-conformant parser, but at least one track will be (most likely the best audio track).

Nov 08 '20 08:11 robUx4

I think libavformat and the libmatroska based demuxers (including VLC) should handle this properly. That already covers a lot of players, demuxers, muxers.

TrackTimestampScale (formerly known as TrackTimecodeScale) is not part of WebM. So parsers exclusively dealing with WebM (the Firefox one, dunno about Chromium) may have issues with this.

Most TV/streaming boxes are probably not using libavformat or libmatroska so I'm not sure they handle this properly either.

Nov 08 '20 08:11 robUx4

The fact that each TrackTimestampScale should be computed using the "packing frequency" of the track will also add some friction. Matroska has always be codec agnostic, ie it doesn't need to know anything about a codec to be muxed (although it does store information about the codec). Now to mux "accurately" we need to know, before writing/having any frame, how many samples will be found in each codec frame. Most codec have a fixed value so it won't be too hard. But modern codecs have many window sizes. That can become tricky to know exactly what to use, in some cases there might not even be a common factor. In that case the "packing frequency" = "sampling frequency" and we can't store a lot of samples per Cluster. or we just give up and sample accuracy.

So I think with all (audio) codec we should mention the number of samples per frame that can be used safely. (sampling frequency / that number = packing frequency). That should be done in the codec specs.

Nov 08 '20 09:11 robUx4

There is also the question of Cues and Chapters. The timestamps are stored in "absolute" values (which means in nanoseconds). Introducing the TrackTimestampScale as "unrounded" floating point means the actual timestamps of a frame is not an exact multiple of the global TimestampScale anymore.

So Cues/Chapters referencing a particular frame (or audio block) should use the same value that will come out of the demuxer. The value is not always exactly the same value that was written but the error is small enough not to mistake it for another sample. But since demuxers/players will compare values when seeking it's better to match exactly the value that will be read in the file.

Using packed audio with a factor on the TrackTimestampScale also means it won't be possible to reference (audio) samples individually. The granularity will depend on the factor applied to the TrackTimestampScale. For example for samples packed by 40 the factor applied may be 2, 4, 5, 8, 10, 20, or 40. That allows more or less duration per Cluster / more or less Cue precision inside a frame of a Block.

Nov 08 '20 15:11 robUx4

Actually once you have the timestamp of the first sample, you don't need to know the rest of the timing in the packed audio. It's outside of the container level that the right sample will be picked.

The timestamp in Cues/Chapters can either use the real timestamp (in nanosecond) of the sample. Or they could shift it the same way the timestamp of the first sample in the Block is shifted. The difference is a rounding error that in the end will result in the sample referenced. Except for this the container doesn't use the values for exact comparison so it will have no impact there. I think it's better to use the real sample timestamp in that case.

Nov 09 '20 06:11 robUx4

matroska-specification matroska-specification copied to clipboard

Consider providing a facility for integer-fraction timescales

Integer video frame rates

NTSC video frame rates

TrueHD frame rates

AAC frame rates

MP3 frame rates

Other audio frame rates

GCF of common short-first audio frame sizes

Raw audio sample rates

MPEGTS time base

Common multiples

matroska-specification
matroska-specification copied to clipboard