rodio Multiple tracks within a container

Containers like Ogg and MP4 can contain multiple tracks. Somewhat similarly, Ogg streams can be "chained" into a single file, so beyond multiple tracks in one container, you then have multiple containers, each one with one or more tracks. Before continuing with more decoder work, I would like to define how we will deal with that.

Current behavior seems emergent - which is a nice word for saying "not by design" 😄 Though I did not test any of it yet, from a code review I am fairly confident the following is true today:

Symphonia

Track selection: For containers with multiple tracks, the Symphonia decoder will load the default track. This will usually be the first track in the container, but does not have to be. After finishing this default track, it will proceed onto the next track in the container, if any. This will succeed if and only if the decoder specification (codec, channels, sample rate) is equal to the default track, in which case the iterator will continue. If the specification is different, no attempt is made to reset the decoder.

total_duration: For the default track only. So in the case of multiple tracks with the same spec, the actual playback time may exceed the reported total_duration.

Seeking: Hard-coded to work with the default track. Calling try_seek with a track playing beyond the default track, will make the demuxer reload the default track and seek in it.

Saturating seeks: Rodio's logic for saturating seeks will saturate on the duration of the default track. So if you've got three 5 minute tracks and you seek to 7'03'' then in effect you will seek to the end of the default track == the start of the second track.

Chained streams: As with tracks, but now for the demuxer instead of the decoder: will succeed if and only if the specification is equal. If the specification is different, no attempt is made to recreate the decoder.

lewton

Track selection: lewton (and ogg underneath it) do not provide high-level functions for determining track numbering, sequencing, or defaults. It will play the first track it encounters. After that, when there are more tracks, it is not certain but likely to stop (depending on whether the packet that comes other only contains headers or also samples).

total_duration: Not available. May be possible by parsing Ogg headers manually, but I don't intend to go there (just use Symphonia).

Seeking: Currently not implemented. Could be re-implemented probably with coarse seeking, in which case it would be relative to the entire container. As the number of tracks increases, so will the headers in between, and the seeking error will increase because you have to seek a number of pages and you don't know how many pages of headers you have. So if you seek to 7'03'' (with the same three 5 minute tracks) you will land somewhere before 7'03'' (so 2'03'' into the second track).

Saturating seeks: Not available, because no total_duration.

Chained streams: As with tracks.

Proposal

First:

Implement DecoderBuilder::with_track_id to optionally specify a track to load
Change the decoders to by default play a single track only

Then:

Implement DecoderBuilder::with_multi_track to:

enable multi-track processing mode
report total_duration (if available) for all tracks
seek relative to the entire container
saturate seeks relative to the entire duration

This last step will not be perfect for chained streams, because you cannot determine total_duration until you have read each stream. That is not desirable nor possible, given that chained streams are literally designed for streaming. Such streams should not allow seeking anyway, so best to have the user call with_seekable(false) on them and forget about total_duration.

Mar 06 '25 14:03 roderickvd

Current behavior seems emergent - which is a nice word for saying "not by design" 😄

made me chuckle, best use of "emergent" I have ever seen 😆. Also correct, tracks have been completely ignored until now (thank you!).

Such streams should not allow seeking anyway

I will not pretend I understand chained streams. But I do not see why they should not allow streaming? In a live-stream usecase seeking back to hear something again then forward to catch up to the live bit is common right? Again I could misunderstand completely what chained streams are :)

sneaky edit: Oh and I like the approach into multiple phases, first implementing the basic case then merging and working on the next 👍. I should learn from that 😅 (span PR is up to 2.5k now, no need for nightmares yet however, we should be able to split the review work up) .

Mar 06 '25 16:03 yara-blue

Such streams should not allow seeking anyway

I will not pretend I understand chained streams. But I do not see why they should not allow streaming? In a live-stream usecase seeking back to hear something again then forward to catch up to the live bit is common right? Again I could misunderstand completely what chained streams are :)

You're right, I'd say, theoretically. Assuming no-one would use an unbounded file or memory store, you could implement it with a ring buffer. Which in turn requires a limited "seek" window, which brings its own not-so-trivial problems in determining what's the currently seekable region.

As you've noticed my experience is with streaming music: Spotify, Deezer, live radio. And in that experience, for live streams, I've always seen the UI disable seeking. Note that we're talking live streams: normal "streaming" which are just chunked downloads with a read-ahead buffer.

Oh and I like the approach into multiple phases, first implementing the basic case then merging and working on the next 👍

Thanks. What do you think about the proposed implementation?

Mar 06 '25 17:03 roderickvd

You're right, I'd say, theoretically. Assuming no-one would use an unbounded file or memory store, you could implement it with a ring buffer. Which in turn requires a limited "seek" window, which brings its own not-so-trivial problems in determining what's the currently seekable region.

I'm very familiar with those problems due to my work on https://github.com/dvdsk/stream-owl (wip). It is not easy but very doable.

Note that we're talking live streams: normal "streaming"

Are those not based on http range requests? (I've only seen those but then I'm only building a podcast app)

Thanks. What do you think about the proposed implementation?

I've looked at it from 2 different user stories:

A music player app: with a random playlist. It wants to query how many tracks there are, get their titles and then maybe play a single one before moving to another file. Keeping the decoder loaded in case we need the next track makes no sense. with_track_id is enough. One could discuss if querying the file for track-id's is within rodio's scope. It might be nice to have, however I am okay with it missing.
An audio book app: one book is usually one file each chapter a track. Usually you want to play chapter by chapter but if you lost where you are you might want to scrub trough the entire book (seeking across the entire file). This is well served by with_multi_track.

I an not think of a radically different use-case. These are well served by your design.

Mar 06 '25 21:03 yara-blue

You're right, I'd say, theoretically. Assuming no-one would use an unbounded file or memory store, you could implement it with a ring buffer. Which in turn requires a limited "seek" window, which brings its own not-so-trivial problems in determining what's the currently seekable region.

I'm very familiar with those problems due to my work on https://github.com/dvdsk/stream-owl (wip). It is not easy but very doable.

Cool. You may be interested in experimental RFC 8673.

Note that we're talking live streams: normal "streaming"

Are those not based on http range requests? (I've only seen those but then I'm only building a podcast app)

Just to be sure as you're quoting half of my sentence, I intended to discern between: a. continuously appended streams (like live radio) b. what society nowadays calls "streaming" (like listening to music from Spotify)

Much like that RFC describes, both are principally based on HTTP range requests. Clearly that's the case for point b, and all is good: it's got a defined content length and you can query away. For point a, it's more akin to reading stdin for whatever comes your way.

But for point a, if you request byte 5000-6000 from a radio station that's been streaming since 2001, and you've tuned in to an hour ago, what are you going to get? You need to buffer it yourself. And if it's variable bitrate, then what will be the timestamp of byte 5000?

Anyway, we digress.

Thanks. What do you think about the proposed implementation?

I've looked at it from 2 different user stories:

A music player app: with a random playlist. It wants to query how many tracks there are, get their titles and then maybe play a single one before moving to another file. Keeping the decoder loaded in case we need the next track makes no sense. with_track_id is enough. One could discuss if querying the file for track-id's is within rodio's scope. It might be nice to have, however I am okay with it missing.

An audio book app: one book is usually one file each chapter a track. Usually you want to play chapter by chapter but if you lost where you are you might want to scrub trough the entire book (seeking across the entire file). This is well served by with_multi_track.

I an not think of a radically different use-case. These are well served by your design.

I'm on the same wavelength.

w.r.t. the track IDs, maybe instead of using the "real" IDs (which could be 123123, 34534, 2342) we could have Rodio use IDs 1, 2, 3 for the order they are in the container.

My gut tells me we shouldn't add a fn num_tracks to the Decoder trait when it'd only work on Symphonia and arguably be an edge case to boot.

Mar 06 '25 21:03 roderickvd

w.r.t. the track IDs, maybe instead of using the "real" IDs (which could be 123123, 34534, 2342) we could have Rodio use IDs 1, 2, 3 for the order they are in the container.

That would make it easier to play all the content in order, or a random piece of it most use cases will want to know the titles/metadata of the tracks. If the know the have the metadata for part 34534 then having to map that to a Rodio ID could be annoying. So I would say lets have Rodio ID's if we also extract metadata, otherwise lets use the "real" IDs so any other metadata crate can easily be used with this.

But for point a, if you request byte 5000-6000 from a radio station that's been streaming since 2001, and you've tuned in to an hour ago, what are you going to get? You need to buffer it yourself. And if it's variable bitrate, then what will be the timestamp of byte 5000?

You do not request byte 5000, you forward read requests form the decoder as range requests. Its the decoder that will keep track of the bit-rate. If the server refuses your range requests that's an IO-error. If the server wants to stream audio without providing the parts of the file that describe sample-rate/channel count it better specify those and keep em constant.

Anyway, we digress.

I could not resist :), but lets handle such servers when we get the issue.

Mar 07 '25 15:03 yara-blue

w.r.t. the track IDs, maybe instead of using the "real" IDs (which could be 123123, 34534, 2342) we could have Rodio use IDs 1, 2, 3 for the order they are in the container.

That would make it easier to play all the content in order, or a random piece of it most use cases will want to know the titles/metadata of the tracks. If the know the have the metadata for part 34534 then having to map that to a Rodio ID could be annoying. So I would say lets have Rodio ID's if we also extract metadata, otherwise lets use the "real" IDs so any other metadata crate can easily be used with this.

If I understand you correctly, you're saying to either:

a. exposing an interface to query "real" track IDs (which leads to: exposing metadata, generally), then mapping to Rodio IDs b. not having Rodio IDs or metadata API, assuming the user knows the "real" track IDs

I'm struggling a bit to wrap that around my head.

First, I'm not sure if I follow the logic of point a. Why would it hurt by counting 0..n as "tracks in the order they physically appear in the container, from start to end" without exposing further metadata?

Second, I feel like any additions we make on metadata are unbalanced between the Symphonia and other decoders. Not something I can really substantiate, just a feeling where it makes the other decoders real second-class citizens to the point that Rodio is starting to favor Symphonia.

If the server wants to stream audio without providing the parts of the file that describe sample-rate/channel count it better specify those and keep em constant.

...and that's precisely what you cannot count on with chained streams, and why Symphonia can emit ResetRequired not only on decoder level but even on demuxer level (format in current Rodio lingo).

Circling back to the start of where this piece of the discussion started: with chained streams, you cannot determine total_duration or saturate seeks. You cannot know until you've consumed the end of the stream.

Mar 09 '25 20:03 roderickvd

First, I'm not sure if I follow the logic of point a. Why would it hurt by counting 0..n as "tracks in the order they physically appear in the container, from start to end" without exposing further metadata?

I though it might get confusing if the metadata (queried via some other crate) does not appear in that same order. The user might then need to set up a lookup table to translate them.

Second, I feel like any additions we make on metadata are unbalanced between the Symphonia and other decoders. Not something I can really substantiate, just a feeling where it makes the other decoders real second-class citizens to the point that Rodio is starting to favor Symphonia.

Agreed, it is also out of the scope or Rodio

Circling back to the start of where this piece of the discussion started: with chained streams, you cannot determine total_duration or saturate seeks. You cannot know until you've consumed the end of the stream.

Ahh I misunderstood, no saturating seeks I thought it means no seeking at all. I think we are on one line now 👍

Mar 09 '25 21:03 yara-blue

#786 lays all the important groundwork for this. It will play an audiobook or chained stream from A to Z, starting from the first track. Seeking is within the current active track. I think this is also the most intuitive (keeping the audiobook as mental image).

What we may still want to add is a way to choose the starting track, and an option to continue or stop playback to the next track in a multi-track container. That way we evade difficult ergonomics about "how do we expose an API to skip back or forward". A user could simply add multiple decoders to a queue whichever way he wants to.

Aug 24 '25 01:08 roderickvd