OpenTimelineIO
OpenTimelineIO copied to clipboard
Media Reference Stream/Channel Specification support
Currently media references are focussed on locating specific assets on and identifying timing metadata for them.
When OTIO is used to drive a system that will concretely pull together media data, it will likely be necessary to identify a specific stream within a file, or even a specific channel within an audio stream.
Would this be a different media reference entirely? Or a property of the media_reference.External
(in addition to the target_url
)?
Also, how are those streams typically addressed? By an integer? By a string? Is there a convention?
I guess there is one question about whether only media_reference.External
refers to a specific asset and the other media reference types are meant to more generically refer to a concept of a piece of media?
My inclination is that any media_reference.External
should always be considered to refer to a concrete asset somewhere (like a QuickTime file or a set of OpenEXR frames). So adding some information about how to select within that asset seems within the scope of media_reference.External
. It's worth considering the proposed generators may be up for discussion as being multi-stream as well for things like bars and tone. So there may be something like a multi-stream base type that GeneratorReference
and External
would derive from.
I've been working with a hypothetical use case for this to help me think about the design. Perhaps an editor uses synced dailies QuickTimes to create an edit. The audio tracks and video tracks would often share the same source asset on disk while actually using separate streams. A conform tool might want to take the exported OTIO, inspect the media references and replace them with references to the original EXR frames and source AIFF files on disk based on the combo of asset metadata and stream used. As it is now, I think this would have to be written in a way that makes assumptions based on track kind, but that doesn't feel correct to me.
In terms of addressing streams, ffmpeg uses zero-based integers to address, I can't speak to other formats. Will need to research.
Some other folks we talked to at SIGGRAPH were interested in this too. I'll look in my notes to see who it was.
My main concern is that code that is naive about streams or channels should still be able to reason about media references. For example, if you want a list of all the media used, or you want to validate that all the references point to valid files/urls then you should be able to just consider the target_url
without needing to know about the internals of the thing it points to.
If you look at AAF and FCP XML they both have this concept. We should use their models as guidance on how this works.
I guess another alternative is encoding this in the target_url string somehow, if there is a common schema for that, that could work - since either way it isn't something that OTIO itself really cares about, it is for whatever is consuming the media.
For reference, the discussion thread #1009 has come up about this issue.
I've been thinking about this problem a bit the past couple days and have a couple ideas. This one idea is base of how you can map audio channels in Resolve.
Resolve has a very flexible GUI that lets you add, map and reorder audio channels pretty much how every you wish.
A media reference could contain a list of streams and a parameter that selects/maps what streams and channels to use. It also describes in what order.
The select/mapping parameter would be a list of stream names and channel indexes. [ ("stream_name", channel_index), ("stream_name", channel_index), ... ]
For example, this would create a mapping of 2 channels. Channel 0 from A3 and the channel 2 from A1
Audio Stream Mapping File Streams
| [{ "name" : "V1", "channels" : 3, "kind" : "video", "metadata" : {} },
| { "name" : "V2", "channels" : 1, "kind" : "video", "metadata" : {} },
[('A3', 0), ('A1', 2)] ---- | { "name" : "A1", "channels" : 3, "kind" : "audio", "metadata" : {} },
| { "name" : "A2", "channels" : 2, "kind" : "audio", "metadata" : {} },
| { "name" : "A3", "channels" : 1, "kind" : "audio", "metadata" : {} }]
I've been mainly thinking of this for audio but this could also work for video formats and for EXR/PSD/Tiff layers too. This system could allow you to select/reorder video channels.
For example this would add a channel from a second stream as a fourth channel.
Video Stream Mapping File Streams
| [{ "name" : "V1", "channels" : 3, "kind" : "video", "metadata" : {} },
| { "name" : "V2", "channels" : 1, "kind" : "video", "metadata" : {} },
['V1', (V2, 0)] ---- | { "name" : "A1", "channels" : 3, "kind" : "audio", "metadata" : {} },
| { "name" : "A2", "channels" : 2, "kind" : "audio", "metadata" : {} },
| { "name" : "A3", "channels" : 1, "kind" : "audio", "metadata" : {} }]
Here 'V1' could be used as a shorthand to say use all the channels from that stream. For formats like EXR stream names could represent layer names. Most Editors and adapter formats probably don't support this fancy video channel shuffling.
The JSON could look something like this
{
"OTIO_SCHEMA": "ExternalReference.1",
"streams" : [{"OTIO_SCHEMA": "StreamDescriptor.1", "name": "V1", "channels": 3, "kind": "video", "metadata": {} },
{"OTIO_SCHEMA": "StreamDescriptor.1", "name": "V2", "channels": 1, "kind": "video", "metadata": {} },
{"OTIO_SCHEMA": "StreamDescriptor.1", "name": "A1", "channels": 3, "kind": "audio", "metadata": {} },
{"OTIO_SCHEMA": "StreamDescriptor.1", "name": "A2", "channels": 2, "kind": "audio", "metadata": {} },
{"OTIO_SCHEMA": "StreamDescriptor.1", "name": "A3", "channels": 1, "kind": "audio", "metadata": {} }],
"stream_mapping": [["A3", 0], ["A1", 2]],
"target_url": "file://path/to/file.mov"
}
Nothing here guarantees that the target_url
file actually contains the streams. This also doesn't describe what the output channels actually represent. For example, is the channel_layout
stereo or surround. Channel layout along with how many audio channels are needed to be mapped could be store on a parent object like a Track
. Checks could be in place to verify the that children satisfy their parents media kind and channel counts. These parameters could all be optional too.
Whether the names of the streams and their order matters also most likely depends on the file format being referenced. This wouldn't allow mapping channels between multiple files (like combine stereo from 2 separate mono files), but that use case is probably out of scope.
How well this maps to the adapters would need some more research. For AAF I can see it solving a bunch of issues with referencing multitrack media. Perhaps this a bit to complex for OTIO, but figured I would suggest it anyway.
With the heavy caveat that this is a bit out of my area of expertise, I'd say this sounds pretty clean. I think I'd push the schema descriptor up a scope to the "streams" attribute:
{
"OTIO_SCHEMA": "ExternalReference.1",
"stream_map": {
"OTIO_SCHEMA": "StreamMapping.1",
"active_mapping": [
["A3", 0],
["A1", 2]
],
"streams": [
{"name": "V1", "channels": 3, "kind": "video", "metadata": {} },
{"name": "V2", "channels": 1, "kind": "video", "metadata": {} },
{"name": "A1", "channels": 3, "kind": "audio", "metadata": {} },
{"name": "A2", "channels": 2, "kind": "audio", "metadata": {} },
{"name": "A3", "channels": 1, "kind": "audio", "metadata": {} }
]
},
"target_url": "file://path/to/file.mov"
}
Is there a sane default behavior we could define for cases when the stream mapping is null? (as would be the case for all existing OTIO files)?
I like moving the schema up better too! For OTIO files without stream mapping, if it is not already defined somewhere, I would propose the default behavior for a app being something like: A app should default to the first stream in the external referenced media that matches media kind defined by the nearest parent of the external reference object. If none of its parents define a kind, default to video.
A app should default to the first stream in the external referenced media that matches media kind defined by the nearest parent of the external reference object. If none of its parents define a kind, default to video.
That sounds like a reasonable starting point!
It would be a good topic for a future TSC meeting. Generally, it could be a good WG subject if folks feel like it is an important addition to OTIO. I'd love to get additional vendor feedback on this, for example, and find some good domain experts to weigh in.
Dumb question: do stream specifications need any kind of additional temporal transformation information? IE does changing the active stream imply anything about the available_range
?
A couple of months ago I commented on the associated discussion thread to this issue (https://github.com/AcademySoftwareFoundation/OpenTimelineIO/discussions/1009) and I'm happy to see some more activity on this topic. The company I work for (Vizrt) are interested in OTIO and this issue is likely the main limiting factor in being able to adopt OTIO for timeline integration use since a large majority of content we deal with is MXF OP1A video with multiple audio tracks. We would happily assist in trying to drive this topic forward towards a solution, both myself and a few collegues of mine.
-
Is this in any way actively worked on and/or discussed beyond what is visible here on github? I've seen it mentioned in a few places that this was planned to be brought up in a TSC meeting, has that been done or is it confirmed that will happen?
-
Is the design settled on that it should be something on the ExternalMediaReference level or is that still an open topic and the solution could end up on Clip or Item level (similar to
source_range
which is a different type of "media selection")? -
In what ways can we help make this feature a reality? We can offer you development assistance, participating in design discussions, testing etc. of this if someone more involved in the project can guide us a bit.
From my perspective the important first step here is trying to reach a solid design for this that the OTIO project approves of. This would very much be a core feature in the OTIO data model so I definitely understand it isn't a design decision that's taken on a whim.
We (Vizrt) would happily contribute in any way we can towards reaching such an approved design. I personally have many years of experience from the media industry and have done integration work with several different NLE systems, and about 20 years overall of software development experience professionally.
@bergner these docs outlines process of developing a new schema https://opentimelineio.readthedocs.io/en/latest/tutorials/developing-a-new-schema.html
Those were my thoughts too about this being an issue for an NLE. That is partly one of the reasons I put forward the current suggestion. :) No design has been settled and AFAIK, this hasn't been brought up in a TSC meeting yet. Having more feedback/proposals would be great, especially from more people in the domain. I would be interested to hear what your clip level proposal would look like, and why at that level?
I did try and start a prototype to see if the current idea will work for the AAF adapter AMA/Media linking use case, but haven't really spent that much time on it.
AFAIK this idea has been kicked around for a while, but never brought to TSC. A small working group to propose a schema, build some test .otio files with it and try it out would be great. I think the best person on the OTIO TSC would probably be @reinecke, but I'd love to see folks from the vendors weigh in too (@rogernelson or someone from one of the NLE vendors - adobe, etc).
Some further questions:
- for referenced things that don't have streams, what would the proper behavior be? Leave
stream_map
to beNone
? - are there different kinds of streams that might not have "channels"? Or that might have other specific fields? In other words, would the overall map AND the stream specification themselves need to be schema, or just the bigger map?
- What if you had an external reference to a USD file for example?
- for referenced things that don't have streams, what would the proper behavior be? Leave stream_map to be None?
- are there different kinds of streams that might not have "channels"? Or that might have other specific fields? In other words, would the overall map AND the stream specification themselves need to be schema, or just the bigger map?
- What if you had an external reference to a USD file for example?
Maybe its best limited this schema's scope to Audio/Visual media? Other media types can define their own mapping schema?
Similar to how ImageSequenceReference
is a more narrowly focused MediaReference
?
Before trying to formulate some more concrete ideas/proposals here I'd like some clarifications on a few OTIO data model specifics, since I think understanding these are quite essential to coming up with a good design.
-
Can an OTIO track contain more than one channel, e.g. a stereo mix? Premiere/Media Composer/FCP all have some notion of mono vs stereo vs some other audio track types, with some conversion mechanisms between some of them.
-
If a track can contain more than one channel should OTIO know about this on a data model level? E.g. how many channels the track contains and maybe some notion of "type" or "mix" on that track?
-
Does the tracks in an OTIO timeline have to reflect the track structure of the underlying source media or should you for example be able to reference channel X without knowing if the underlying source file has N tracks with 1 channel each, 1 track with N channels or perhaps several stereo tracks? I believe both Premiere and FCP are sort of agnostic to this whereas FFmpeg uses "track:channel" type references and hence requires knowledge of the track structure.
-
From OTIO docs: "The children of a Track can be any Composable object, which includes Clips, Gaps, Tracks, Stacks, and Transitions." Which of those Composable subclasses have a similar challenge to Clip + ExternalReference when it comes to selecting some track/channel subset (or implying some kind of mix to "fit" in the track)? I would guess at least a Stack within a Track falls in this category also and if not a clarification of the purpose and use case(s) of Stack on a Track would help. Potentially also Track within a Track if for example the inner track contains a stereo mix and in the outer track you only want the first channel. This can in theory also apply to Clip + some other MediaReference type that provides a multi-track representation.
And as for my ideas about representing this at a higher level than ExternalReference (e.g. Clip/Item or thereabouts) is mostly because I suspect other entities in the OTIO data model will face the same source track/channel selection problem, hence some of the above questions.
No responses here unfortunately but I'll proceed with more details anyway. This is my take on this problem, but it is notably based on an assumption that a Track in OTIO can currently only hold a single channel.
otio.core.SerializableObjectWithMetadata
|-- otio.schema.Timeline
|-- otio.core.MediaReference
| `-- otio.schema.ExternalReference
`-- otio.core.Composable
|-- otio.core.Transition
`-- otio.core.Item
|-- otio.core.Composition
| |-- otio.schema.Track
| `-- otio.schema.Stack
|-- otio.schema.Clip
`-- otio.schema.Gap
Clip contains a map of MediaReference, and an active media reference key.
Track contains a (horizonal) list of Composable(s) and a kind (Video or Audio).
Stack contains a (vertical) list of Track(s).
Timeline contains a Stack and currently has convenience methods audio_tracks() and video_tracks() that iterates over the children of its stack.
Code & data model observations
- If a Stack is specifically meant to contain Track(s) then the
video_tracks()
andaudio_tracks()
convenience methods should perhaps be on the Stack object and Timeline's ditos should just delegate to the Stack's methods. - A Stack has the notion of an ordered list of tracks and that combined with that a Stack can be placed on a Track suggests that a Stack should perhaps expose an API for track selection, e.g.
audio_track(int index)
andvideo_track(int index)
, orget_track(const std::string &kind, int index)
and have a schema property that allows for a numeric channel/track reference. If unspecified it would mean "composite all video tracks" on Tracks of kind "video" and supposedly "first audio channel" on Tracks of kind "audio" (although the implicit audio case seems questionable to me if there is more than one audio channel in the nested stack). - Clip + MediaReference/ExternalReference needs information about source track/channel selection, but so seemingly does a Stack on a Track. Common ancestor between Clip and Stack is the Item class. If both of them have the same fundamental source track selection concern elevating the specification of what source track selection to use to the Item class seems like a reasonable approach to explore. This information would need to be propagated downwards to a media demuxer but that's an external implementation concern and the more immediate goal here is to be able to express source track/channel selection in various timeline formats where OTIO adapters exist. This is also in line with the current placement of
source_range
in the OTIO data model. - If realistically ONLY Clip and Stack will face this problem then it might be more suitable to just add the same property (e.g.
source_reference
orsource_selector
) to both Clip and Stack rather than Item. This would avoid polluting the schema for other subclasses of Item.
Documentation observations
- Documentation for Stack in a Track https://opentimelineio.readthedocs.io/en/latest/tutorials/otio-timeline-structure.html#multiple-tracks shows that all tracks in the stack are composited in a last-to-first order (so first track will render on top). Those examples contradicts earlier Python code (prior to C++ migration) which states that compositing order is first-to-last: https://github.com/AcademySoftwareFoundation/OpenTimelineIO/blob/cc78a7c5e9b10808d524edd00a09f8a11b21ce1a/opentimelineio/schema/stack.py#L25-L38
- If the last-to-first composition order is indeed the intended one it is the reverse of what NLEs tend to do. In an NLE the first video track (V1) is the bottom-most track but in OTIO docs the first video track is the top-most track in the composition order. I think this should be more explicitly clarified in documentation since "adding another video track" usually means "add another video track on top of my existing tracks" in NLE speak but in OTIO that operation involves shifting the track list and inserting a new first element based on what the current documentation says.
- Having such automatic composition behavior for all video tracks of a Clip + ExternalReference where the underlying file contains 2 or more video "output" tracks seems undesirable. Multiple video tracks in the same file is much less common than multiple audio tracks in the same file but at least in such a case I think the most common use case is "pick the video stream quality you want to use" rather than compositing both tracks on top of each other. If you really wanted to composite the two video tracks in the media file you would need to create two OTIO tracks and clip on those tracks, and each referencing one video track in the underlying file, or use a nested stack with that structure.
Initial conclusions
- Stack and Clip+MediaReference face very similar challenges here with the exception that Stack has different video track compositing behavior when dealing with multiple video tracks. Stack = composite all video tracks, Clip = select video track. For audio Stack and Clip are the same in that a channel would need to be selected in order to fit in the parent track.
- For audio we need channel selection since an "audio track" in OTIO seems afaict to actually denote a "channel" and not a "potentially multi-channel track" because the Track schema has no notions at all about channel count or mix/type.
- Selecting a video track from a multi-track video file feels like the exact same case as selecting an audio channel from a multi-channel file or stack.
Basic design suggestion
- Add
source_reference
orsource_selector
property to either Item or Clip+Stack - Define a SourceReference schema to allow future extensibility if needed.
- Initially the SourceReference can have a single numeric property: "channel". The type is implicit based on the Track's kind property. This is also usable on multi-track video to denote which video "channel" to select.
- Document usage of SourceReference and limitations/semantics when SourceReference is not specified. E.g. it is equivalent to
"source_reference": { "OTIO_SCHEMA": "SourceReference.1", "channel": 0 }
(or 1 if we number from 1 to N)
Future extensibility scenarios
Track is extended to explicitly represent a multi-channel audio track that can contain multiple channels => it now becomes relevant to be able to select an underlying physical track from the file or potentially select multiple channels and group them in a multi-channel OTIO track. SourceReference with "channels" and/or "mix" perhaps.
Clip is extended to have a "group id" and maybe "group type" to allow multiple tracks to be grouped together in order to express that some channels from the same source belong together and can be operated on as a unit => setting such group parameters is independent of the selected channel and only serves to convey additional semantics about relation between some channels. No changes to SourceReference.
Allow track + channel based source references => select channel based on physical track number and channel number within that track. Need to add "track" property to SourceReference and if that property is present the "channel" value in interpreted as meaning channel within the specified track. Note that such an extension imposes a restriction that all underlying media references must have the same physical track structure which is why this is not proposed now.
Allow selecting multiple channels and mix them to a single channel => "channels" and "mix" attributes in SourceReference denoting how to combine them. The mix must result in something that "fits" in the parent track, i.e. if parent track denotes a single channel the result must be a single channel. This very much feels dependent on tracks having some degree of "multi-channel support" since mixing multiple channels down to a single channel is not really common.
{
"OTIO_SCHEMA": "Track.1",
"kind": "audio",
"children":[
{
"OTIO_SCHEMA": "Clip.1"
...
"source_reference": {
"OTIO_SCHEMA": "SourceReference.1",
"channel": 2
}
},
{
"OTIO_SCHEMA": "Stack.1"
...
"source_reference": {
"OTIO_SCHEMA": "SourceReference.1",
"channel": 0
}
}
]
}
Hi, Just checking in to see if multichannel audio is supported yet? I can't seem to get my python script to generate an MOV's stereo audio stream into a clip on an audio track. I'm not sure if this is either unsupported or I'm just not doing it properly. Cheers, Josh
I was playing around with my proposal and found a potential issue.
One very useful use case is using a Multi-References to represent the same media but in different file formats. Each external reference has the desired data but in different streams and channel locations. for example:
{
"OTIO_SCHEMA": "Clip.2",
"name": "audio_clip_right_channel",
"active_media_reference": "DEFAULT_MEDIA",
"media_references" : {
"DEFAULT_MEDIA": {
"OTIO_SCHEMA": "ExternalReference.2",
"name" : "mp4_container",
"stream_map": {
"OTIO_SCHEMA": "StreamMapping.1",
"active_mapping" : [["A1", 1]],
"streams": [
{"name": "V1", "channels": 3, "kind": "video", "metadata": {} },
{"name": "A1", "channels": 2, "kind": "audio", "metadata": {} }
]
},
"target_url": "file://path/to/file.mp4"
},
"high_quality": {
"OTIO_SCHEMA": "ExternalReference.2",
"name" : "audio_right_channel",
"stream_map": {
"OTIO_SCHEMA": "StreamMapping.1",
"active_mapping" : [["A0", 0]],
"streams": [
{"name": "A0", "channels": 1, "kind": "audio", "metadata": {} }
]
},
"target_url": "file://path/to/file.wav"
}
}
}
This example should have enough data for an app to map right audio channel correctly.
The issue is that if OTIO ever implements instancing, media references wouldn't be as reusable as they could be. The active_mapping
being part of the StreamMapping
object and the MediaReference
object means every new mapping has to be defined as an independent MediaReference. Kind of ugly in my opinion but maybe it's not as bad is it sounds.
@meshula and I had a conversation quite a while about this and came up with some ideas.
Some Thoughts
For the purpose of getting movement on this, maybe it's best to start from here:
- A given OTIO track can represent an edit on N channels - this means a track can be "stereo" or "surround" in the output.
- The number of channels and their semmantic mapping (i.e. what speakers they route to) is consistent for the duration of the track (for now).
- The addressing scheme of track/stream may vary based on the underlying media container format. For instance, left and right audio may be represented in the container as a single track with two channels, as multiple tracks with a single channel each, or the container format may not even have a concept of a "track" and simply have a number of channels. This means semantic groupings of audio channels may not match how they're grouped in the container's storage. We should be flexible and extensible in how we handle this and let that be abstracted at the Media Reference level.
MediaReference
The goals at the MediaReference level should be:
- Identify which discrete sampled media is available within the referenced asset in a way that a counsuming application could address it.
- Provide a semmantic abstration so those media can be used within the composition independent of their encoding within the container.
The MediaReference will have an available_channels
mapping which then stores
key/value pairs of channel label to data about how to address those channels in
the container.
So, practically, a simple data model for identifying media in a QuickTime MOV:
{
"OTIO_SCHEMA": "ExternalReference.1",
"available_channels": {
"Video": {
"mono": { "stream": 0 }
},
"Audio": {
"left": { "stream": 1, "channel": 0 },
"right": { "stream": 1, "channel": 1 }
}
}
... other MediaReference fields
}
In this context, we are mapping what MOV calls a "track" to something we're calling a "stream" - which is more in line with ffmpeg's abstraction. We'll likely need to design schemes for this on a per-container format basis.
This concept may be extensible to a MOV with stereoscopic video as two separate video tracks. It could look something like:
{
"OTIO_SCHEMA": "ExternalReference.1",
"available_channels": {
"Video": {
"left": { "stream": 0 },
"right": { "stream": 2 }
},
"Audio": {
"left": { "stream": 1, "channel": 0 },
"right": { "stream": 1, "channel": 1 }
}
}
... other MediaReference fields
}
The keys in this mapping provide a semmantic name that can be used to refer to the source's channel from then on.
Track
A track represents what will be a logical source for the composition output. At
the track level, just the output channel labels are defined as a list where the
order defines the channel layout - track.kind
defines the media type.
A typical stereo track would have an entry like:
{
"OTIO_SCHEMA": "Track.1",
"ouput_channels": ["left", "right"],
"kind": "Audio"
... other Track fields
}
Clip
A clip is the usage of a given piece of media in the composition - therefore it can use a 2D matrix where each entry is the contribution level of a given source to a given output.
To define the positional correlation of entries in the row vector, a clip has
a source_channels
array. The positions in the column vector are defined by
the track's output_channels
.
A clip's model might look like this:
{
"OTIO_SCHEMA": "Clip.1",
"source_channels": ["left", "right"],
"audio_channel_matrix": [
[1.0, 0.0],
[0.0, 1.0]
]
... other Clip fields
}
The matrix above describes an identity mapping where the "left" and "right" audio from the source are mapped to left and right on the outuput as illustrated below:
left | right | |
---|---|---|
left | 1.0 | 0.0 |
right | 0.0 | 1.0 |
If you wanted to pan the source left channel to the output right a litte, you would make a matrix like this:
{
"OTIO_SCHEMA": "Clip.1",
"source_channels": ["left", "right"],
"audio_channel_matrix": [
[0.8, 0.0],
[0.2, 1.0]
]
... other Clip fields
}
MediaRef swapping
Let's suppose, however, that you switch the MediaReference on the clip to point to another souce with 5.1 audio but want to output to the same stereo track.
The MediaReference for that source might be:
{
"available_channels": {
"Video": {
"mono": { "stream": 0 }
},
"Audio": {
"left_front": { "stream": 1, "channel": 0 },
"right_front": { "stream": 1, "channel": 1 },
"center": { "stream": 1, "channel": 2 },
"left_rear": { "stream": 1, "channel": 3 },
"right_rear": { "stream": 1, "channel": 4 },
"low_frequency_effects": { "stream": 1, "channel": 5 },
}
}
}
This new media ref has no overlapping sources with our original matrix. However
we can append additional channels from this media reference in the clip's source_channels
list and add matrix entries to downmix the new channel set:
{
"OTIO_SCHEMA": "Clip.1",
"source_channels": [
"left",
"right",
"left_front",
"right_front",
"center",
"left_rear",
"right_rear",
"low_frequency_effects"
],
"audio_channel_matrix": [
[ 1.0, 0.0, 1.0, 0.0, 0.707, 0.5, 0.0, 0.0 ],
[ 0.0, 1.0, 0.0, 1.0, 0.707, 0.0, 0.5, 0.0 ]
]
... other Clip fields
}
The above matrix describes an left only right only (Lo/Ro) downmix from a 5.1 source as described in page 95 of ATSC A/52:2012 (at least as best as I understand). At playback time, if the matrix contains a source channel not in the current MediaReference, it's simply ignored. This allows for MediaReferences to be swapped.
Open Questions
- What are the right units to use in the matrix? Right now I'm assuming a coefficient.
- I haven't worked much with Stereo video, I'd love to hear some examples of common ways it's represented in container formats to try out this modeling.
- This is designed around channel labels with semmantic meaning. How does this work with production audio recording that might use character names? How do we handle files with no channel identification? How do we avoid unexpected conflicts?
- Will we have a set of "standardized" names like we do with
TrackKind
? - I did not go through the exercise of mapping Dolby Atmos object-based audio into this. I'd love to hear thoughts from the community (maybe Blackmagic has some thoughts about that from the Fairlight perspective).
Curveball: What about data channels like timecodes?
@timlehr Love it - can you pitch some use cases of what you'd like to do? Subtitles/Captions seem like a candidate too maybe?
I'm liking it :)
Can there be more than one audio matrix? You demonstrate 7.1 to stereo, but I can easily imagine I might wish to state, if your output is 7.1 use this identity matrix, if 5.1 use this matrix, and otherwise here's a stereo fallback.
Re: Stereo Video. I come from the VFX world, so I look to OpenEXR for how they handle multiple views. It's worth noting there are certain use cases (VR/360 shoots) where there would be more than two views.
@markreidvfx briefly mentioned this in a previous comment above, but the ability of an EXR image sequence (or PSD / TIFF) to store multiple layers could work here for the picture side of things.
It's common for additional layers to be included in an EXR image sequence in the VFX world. Different lighting passes from a renderer, depth information, mattes, motion vectors, etc.
Nomenclature might get a bit confusing because audio has channels that belong to streams (left, center, right, left_surround, right_surround, lfe, etc.) and video will have channels that belong to layers (red, green, blue, alpha, custom, etc.). At least that's the naming Nuke uses which is what I'm familiar with.
Thanks everyone for the engagement! I have a few new thoughts/responses to our discussion.
Matrix Modeling in Clip
Naming
In the previous proposal, the field holding the mapping matrix is called audio_channel_matrix
, this makes the field only applicable to audio. I think we should make the field more consistent with available_channels
and convert it to a mapping called something like channel_matrices
where each key is a MediaKind
string and the matrix is stored in the value.
I think this makes sense because when I think about code, it may be common to have a function that creates a clip from metadata about a piece of media on disk, this allows users to have a self-contained function that does that and allows them to place copies of that clip across both Video
and Audio
tracks.
More Explicit Matrix Labels
In my previous proposal, the matrices are expressed as an array of arrays. While this is the most compact form for this information, if someone changes either the track's output_channels
or the clip's source_channels
then they need to be judicious about re-building all the impacted clip matrices in the data model. This could be error prone and near impossible to sort out if something is done wrong.
An alternate way this could be expressed on the clip would be to model the matrices as nested objects where the matrix is expressed as a mapping of output channel to source contribution coefficients. We could also then omit the source_channels
array from the clip. This makes it easier to detect and resolve inconsistencies at the cost of making the files a bit more verbose.
With both the proposed changes, the Lo/Ro downmix example would look like this:
{
"OTIO_SCHEMA": "Clip.1",
"channel_matrices": {
"Audio": {
"left": {
"left": 1.0,
"right": 0.0,
"left_front": 1.0,
"right_front": 0.0,
"center": 0.707,
"left_rear": 0.5,
"right_rear": 0.0,
"low_frequency_effects": 0.0
},
"right": {
"left": 0.0,
"right": 1.0,
"left_front": 0.0,
"right_front": 1.0,
"center": 0.707,
"left_rear": 0.0,
"right_rear": 0.5,
"low_frequency_effects": 0.0
}
}
}
... other Clip fields
}
I think this solution also bolsters local reasoning within a clip, it's easier to decode what's expected from the mapping.
Specific Responses
First, I have some responses to my own questions based on conversations I've had with smarter people than me:
What are the right units to use in the matrix? Right now I'm assuming a coefficient.
I think coefficients are the way to go, in addition to potentially being media type agnostic, a colleague gave me this piece of feedback:
I don’t recommend dB because you would need to mute channels and be forced to represent negative infinity somehow.
One discussion to have is if they are coefficients, what does that mean for video? Opacity? What does that mean for a timed text track? Perhaps the data type of the matrix values is media type dependent and maybe video and timed text use true
/false
. There are some semantics we may want to consider developing code-level enforcement for.
This is designed around channel labels with semantic meaning. How does this work with production audio recording that might use character names? How do we handle files with no channel identification? How do we avoid unexpected conflicts?
For audio at least, channel naming is often not intrinsic in the container formats, we often rely on inference to understand what those channels roles are. I think establish a convention that each source channel must have a unique name - a "well-known" name is preferred where possible, but any appropriate unique name for the context is allowable.
Will we have a set of "standardized" names like we do with TrackKind?
Yes, we should.
I did not go through the exercise of mapping Dolby Atmos object-based audio into this.
After consultation with colleagues, Dolby Atmos is a system in itself that is beyond what we probably want to model. In that case we might do something like specify the mixdown and channel to select from that mixdown in the source's available_channels
.
@meshula asked:
Can there be more than one audio matrix? You demonstrate 7.1 to stereo, but I can easily imagine I might wish to state, if your output is 7.1 use this identity matrix, if 5.1 use this matrix, and otherwise here's a stereo fallback.
It's an interesting thought. In the new proposal, we could nest channel_matrices
one level deeper with the top-level key being the channel-mapping. However, I feel like the typical use case for that might be a bit more in the presentation domain than the editorial intent domain. Most editors are going to use something like a "Timeline Settings" dialog to set the presentation audio channel configuration. Various software may automatically mix that down for the hardware setup that's available, but the editor would likely still be authoring with intent for a specific channel configuration.
That said, I can envision that when switching between a 5.1 and 7.1 media reference, for instance, the weights might need to be re-balanced. I'm reluctant to introduce the complexity because it feels like a lot to manage for what might be a more niche use case.
@camkerr mentions:
Nomenclature might get a bit confusing because audio has channels that belong to streams (left, center, right, left_surround, right_surround, lfe, etc.) and video will have channels that belong to layers (red, green, blue, alpha, custom, etc.).
This implies we may want to namespace our "well-known" names - but that may be somewhat covered by the fact that all the mappings include media type namespacing - I think working through some test cases may be the best way to surface this kind of awkwardness.
@camkerr also mentioned:
Re: Stereo Video. I come from the VFX world, so I look to OpenEXR for how they handle multiple views. It's worth noting there are certain use cases (VR/360 shoots) where there would be more than two views.
The view
key for the OpenEXR would be a great identifier to use within the available_channels
objects - this feels like it might fit nicely into our proposed model.
I'm not opposed to the "verbose matrix" formulation, but I'm not sure I'm convinced about the "easier to reason about it" argument. I'm expecting that when one works with a bussing matrix, you've got tools that optimally help you through the user interface. With the compact form, and with the verbose form, in neither case would I expect someone to stare at the json and reason about how things might sound or whatever. Also, parsing the verbose form is ambiguous. What's the following?
"left": {
"left": 1.0,
"right": 0.0,
"center": 0.707,
"left_rear": 0.5,
},
Is it 5.1? 7.1? Would you require that 7.1 is exhaustively encoded, and then we'd validate that all seven channels are accounted for?
I see the value in having "known names" for the common set ups, as it means that a 5.1 set up maps to a 5.1 set up no matter where you take it. But I'm not liking the idea of needing to apply heuristics to deduce the intended mix.
That's not the only use case though of transporting a common set up from here to there. As originally conceived, I could author a game oriented file, where the output channels for sounds meant to be player specific might be named
character voice
lightsaber
blaster
player fx
foley
headset
I might be marshalling audio for a bespoke audio stage, where I have speakers located equally spaced on pillars in the space, and the names might be
pillar-east
pillar-west
pillar-north-upper
etc
So I prefer that whether or not there are special names to correspond to standard broadcast set ups, that we also accommodate non-broadcast use cases.
Edit: [I'm confused by the assertion that you'd be forced to represent a value of negative infinity decibels, I'd like to know more about that as it's not something I've personally every needed in a composition.] My confusion is irrelevant, we're all agreeing about coefficients :)
The nice thing about the verbose matrix formulation is that we don't have to worry about de-indexing, so I've come around to that.