glTF
glTF copied to clipboard
KHR_audio_emitter
KHR_audio_emitter
This extension adds the ability to store audio data and represent both global and positional audio emitters in a scene.
This extension is intended to unify the OMI_audio_emitter and MSFT_audio_emitter extensions into a common audio extension.
Members from the Open Metaverse Interoperability Group, Microsoft, and other Khronos members met on 3/23/2022 to discuss how we might unify OMI's work on audio emitters and Microsoft's previous efforts. OMI_audio_emitter has a subset of the MSFT_audio_emitter extension's features, eliminating features that might be out of scope for a more general extension. KHR_audio_emitter also has some spec changes from the original OMI_audio_emitter requested by Khronos members. There are still some outstanding questions on audio formats, MP3 vs WAV, and what features within these formats should be supported.
We're looking forward to working together on this extension and bringing audio to glTF!
Love that this space is getting more attention. I am a little concerned about the elimination of a number of features that MSFT_audio_emitter has. Some limitations are understandable such as the integration with animation. I expect the thinking is that the model and its audio is driven externally by the loaded system. Though this approach may give an artist less control, or at minimum a more complicated workflow. Other limitations I'm not sure I understand such as limiting to one emitter per node. This would just lead to having a number of child nodes for scenarios that require it.
I'd like to remind folks of a demo video we produced for MSFT_audio_emitter that was completely data driven.
I wanted to chime in here on the randomized audio clips. I'm generally opposed to having specific fixed-function features in glTF where not necessary, because those features will have to be implemented by everyone and maintained, even in the future when extensions such as KHR_animation2 have been developed more.
That said, the demo you link is really cool. I can't remember the original document but I did see that randomizing environment audio adds a lot to immersion, compared to having a looping set of tracks that play one after the other.
So regarding randomized clips, I'm a bit torn here. I might also suggest maybe some middle ground: rather than require a weighting/randomization system, allow multiple clips per audio emitter, but leave it up to the application or future extensions to implement the randomization / support selecting audio clips. (and otherwise allow just playing the first clip).
As for multiple emitters per node, I would suggest this is not necessary: it would be very easy to add a child node (with no translation, rotation or scale) with another emitter on it. This is similar to how each node only has one mesh, but multiple meshes can be easily added as child nodes.
Updated the example C++ source code for KHR_audio here: https://github.com/ux3d/OMI/tree/KHR_audio
Yeah we talked about this extensively in the OMI glTF meeting yesterday. I'm personally on the side of making this extension as simple as possible. I also realize now that it is a bit odd that we allow multiple emitters on the scene, but not nodes.
Given this feedback my recommendations are:
- Maybe we want to expand the spec to add basic mixing support via supporting multiple audio sources on an audio emitter, each with their own gain value.
Here's a proposal with multiple inputs per emitter. There is a gain value on the emitter as well as each of the sources as you would usually see in a mixer. In this proposal the only inputs are audio sources, but you could imagine other audio processing nodes in there as well, similar to the WebAudio API. I think we want to make this spec as simple as possible without limiting future extensions to add more features and adding mixing to the core spec isn't a huge ask.
{
"emitters": [
{
"name": "Positional Emitter",
"type": "positional",
"gain": 0.8,
"inputs": [0, 1],
"positional": {
"coneInnerAngle": 6.283185307179586,
"coneOuterAngle": 6.283185307179586,
"coneOuterGain": 0.0,
"distanceModel": "inverse",
"maxDistance": 10.0,
"refDistance": 1.0,
"rolloffFactor": 0.8
}
}
],
"sources": [
{
"name": "Clip 1",
"gain": 0.6,
"playing": true,
"loop": true,
"audio": 0
},
{
"name": "Clip 2",
"gain": 0.6,
"playing": true,
"loop": true,
"audio": 1
}
],
"audio": [
{
"uri": "audio1.mp3"
},
{
"bufferView": 0,
"mimeType": "audio/mpeg"
}
]
}
-
We may want to only allow for one global emitter in light of what I'm proposing with mixing audio. You'd just add multiple inputs for the global emitter.
-
For clip randomization I do think we should be introducing that through some other more generic event based system. Having bespoke behavioral features in the base spec doesn't make a whole lot of sense to me. Clip randomization sounds very useful, the demo shown above is awesome, but we also could use this feature for randomizing active idle animations, or a material color.
-
KHR_animation2
should be able to target some of the properties in this spec to play one shot clips multiple times, or start looping a clip after a certain frame. -
We might want to add
loopStart
andloopEnd
properties to audio sources such that a single audio file can be used and the source samples a section from that file. -
We need to add language to the spec to clarify that positional audio emitters "MUST" use a mono audio source
Just want to voice a +1 to Robert's proposed changes above particularly the bits around having multiple sources in the array. I've implemented the OMI audio spec in my WordPress plugin ( https://3ov.xyz ) and this change is early enough and makes enough sense that it is not impacting to my current users.
Also a +1 on having one global emitter and having inputs that feed into that global.
I don't feel strongly about loopStart and loopEnd but do see the benefits.
We got feedback on KHR_animation2
. It is renamed to KHR_animation_pointer
and we are making good progress on it.
https://github.com/KhronosGroup/glTF/pull/2147
One area that I'd love to see discussed here is behaviour under root scale for AR, as this is something currently broken in both SceneViewer and QuickLook (since it was unspecified, I guess). Happy to provide sample files.
Consider the typical scenario of placing a model in AR with model-viewer, and the model has spatial audio. Should scaling the model down to 10% of the size
- assume that the model is now small, but the distance to the viewer is approximately the same?
- assume that the viewer is now large and the model hasn't changed scale, and thus the distance to the viewer is now much larger?
Unfortunately such considerations were omitted from lighting in glTF, and thus lighting in AR is also "broken everywhere" right now :)
To explain why that's not trivial, here's my take (and happy to take the lighting discussion elsewhere, here for context):
- audio volume should work as if the model became smaller (if I scale an AR radio down, I want to hear it at the same loudness)
- audio spatialization should work as if the viewer got bigger (if I have 3 spread out emitters in a space and scale that space down, I still want clear spatial separation between these sources)
- lighting effects should work as if the viewer got bigger (if I scale a lamp with a spot light down, I don't want the light to become brighter, it should look the same just smaller)
One area that I'd love to see discussed here is behaviour under root scale for AR, as this is something currently broken in both SceneViewer and QuickLook (since it was unspecified, I guess).
If needed, perhaps this could be specified under the model3d
XMP Namespace in KHR_xmp_json_ld
, alongside model3d:preferredSurfaces
for AR? I'm nervous to bring much wider-scoped issues ("how to compose the glTF asset within a larger scene of unknown format for unknown purpose") into KHR_ extensions, we try to define what the content "means" rather than how various categories of applications (AR, HTML viewer, game, etc.) might use it.
I generally agree, but
we try to define what the content "means"
is exactly what I'm trying to say: I think an extension adding audio (or lights) should talk about how that audio works. AR is a strong usecase for glTF, and it would simply be "undefined" (every viewer would do something different) if not explained in the extension that adds audio, in my opinion. Better would have been if such topics ("behaviour under different types of scales") would have been part of the Core spec, of course - if they would have been, they'd have probably forwarded responsibility to extensions though...
I can just say: SceneViewer and QuickLook (glTF and USDZ, respectively) allow for the usage of audio and lights, and both haven't defined/thought about how it behaves under AR scale, so right now we can't use it for many projects where we'd like to. If it stays unspecified for KHR_audio, it's immediately the same issue.
I'll take a deeper look at KHR_xmp_json_ld! Could you let me know where I find more about model3d:preferredSurfaces
and other agreed-upon properties? For me that extension so far sounded like "structured extras", but sounds like there's actually more to it?
(Edit: only hint I found was in the Khronos blog)
Could you let me know where I find more about model3d:preferredSurfaces and other agreed-upon properties?
Probably the place to start would be Recommendations for KHR_xmp_json_ld
usage in glTF.
I think I'm worried that "different types of scale" as a concept has a lot to do with emerging ideas of an "AR viewer" vs. other application types, and that these definitions may change drastically on a timeline of 1-2 years or even months. Embedding these context-specific requirements into KHR_punctual_lights (for example) could cause the extension to become outdated far sooner than it might otherwise. With an XMP namespace there is a proper versioning system and more flexibility to evolve or adapt to specific contexts. I suspect the same applies to KHR_audio.
- We need to add language to the spec to clarify that positional audio emitters "MUST" use a mono audio source Also, that
global
audio emitters "MUST" use a stereo source. Technically, a mono would work as well, but this probably just causes complains by the end user.
Technically, a mono would work as well, but this probably just causes complains by the end user.
Yup that makes sense. We could remove the stereo audio source requirement from the global emitter.
we try to define what the content "means" rather than how various categories of applications (AR, HTML viewer, game, etc.) might use it.
We spoke about this during the OMI glTF meeting and KHR_audio should defined within the reference of the glTF document. We agree that the document should define what the content "means" but that the behavior of scaling the content should be up to your application.
However, if an animation in the document is controlling the scaling of the node we should define that behavior and maybe that should inform best practices in this AR viewer use-case.
So in the case of maxDistance
what should happen with it when scaled down to 10% of it's original scale?
maxDistance
The maximum distance between the emitter and listener, after which the volume will not be reduced any further.
maximumDistance
may only be applied when thedistanceModel
is set to linear. Otherwise, it should be ignored.
Should this respect node scale?
Note that in Unity, Audio Emitter scale is separated from the GameObject scale. But in the MSFT_audio_emitter
spec, it does mention that audio attenuation should be calculated in the emitter space.
So in the case of
maxDistance
what should happen with it when scaled down to 10% of it's original scale?
maxDistance
The maximum distance between the emitter and listener, after which the volume will not be reduced any further.maximumDistance
may only be applied when thedistanceModel
is set to linear. Otherwise, it should be ignored.Should this respect node scale?
I suggest, that the behaviour is like in the KHR_lights_punctual
extension:
https://github.com/KhronosGroup/glTF/tree/main/extensions/2.0/Khronos/KHR_lights_punctual
"The light's transform is affected by the node's world scale, but all properties of the light (such as range and intensity) are unaffected."
"Light properties are unaffected by node transforms — for example, range and intensity do not change with scale."
So, the final position of the audio is affected by scale, but not the properties of audio.
This is unfortunately exactly what breaks all existing viewers in Augmented Reality mode, where users can "make the content smaller or bigger" without any agreement on what that means – is it the viewer getting bigger or the content getting smaller or a mix of both. See my comment above for some more descriptive cases. "node's world scale" is well defined inside one glTF scene but isn't well defined when placing that scene in another context. There's an additional "scene scale" of sorts.
This is unfortunately exactly what breaks all existing viewers in Augmented Reality mode, where users can "make the content smaller or bigger" without any agreement on what that means – is it the viewer getting bigger or the content getting smaller or a mix of both. See my comment above for some more descriptive cases. "node's world scale" is well defined inside one glTF scene but isn't well defined when placing that scene in another context. There's an additional "scene scale" of sorts.
We should realy define this consistent inside glTF. A pure glTF viewer should behave like I described. I got your point and your usecase is using glTF in another application. Not glTF is defining the behaviour, it is the AR viewer. And next time, you import the glTF in a game engine. And there, it is also different.
BTW, Blender behaves this way e.g you can try it out with a point light radius. It stays the same, independent of the scale.
Personally I think this says "we should only care about what happens inside the ivory tower"... I'm not sure what a "pure" glTF Viewer is, everything has a context.
I created a new issue to track this problem outside of just KHR_audio:
- https://github.com/KhronosGroup/glTF/issues/2162
See https://github.com/KhronosGroup/glTF/pull/2137#issuecomment-1129011018 — I think it is fine to try to define what you're asking for, but I do not think that requirement belongs in KHR_audio or KHR_lights_punctual. A specification that can properly deal with the ambiguities of novel and rapidly changing contexts is required to do what you suggest, and that will burden these extensions too much.
Hm, I don't think it belongs tucked away in some unspecified metadata either. Otherwise a case could be made that 95% of extensions would be better suited as metadata.
If you read through the issue I opened, I explicitly mention there that
- ideally the discussions and definitions around this belongs elsewhere (not into either KHR_audio or KHR_lights_punctual)
- extensions should be encouraged to call out how they intend to behave
- I don't believe this needs to be normative.
Hey everyone, we've been discussing this spec over the past couple months and we have a few changes that allow for some of the requested features above.
First is splitting audio data and audio sources. Audio Data is similar to an Image
where it has the uri
or bufferView
and mimeType
properties. Audio Sources are intended to be used for controlling playback state on one or more Audio Emitters. An Audio Emitter can also reference multiple Audio Sources which allows you to mix multiple audio tracks through an emitter.
We've also changed playing
back to autoPlay
which signifies that an audio source should play on load.
We'd love to get everyone's feedback on these changes!
Also, I believe we (The Matrix Foundation) have submitted our app to join Khronos. So I will hopefully be around to participate during the working group meetings in the near future. Hopefully that helps move this spec (and others) forward a little faster.
Hi, just been sent this thread in response to a message I put out on social media to see if it's possible to embed a wav into a glb model.. I use Blender to make the models and want them to play audio when uploaded to sketchfab or a nft platform. Is there a simple way I can accomplish this? If not then I'm happy to pay someone to help!
Best, Dan
@DanAbel77 Sketchfab doesn't support embedded audio in glTF files, you need to upload it manually. Most NFT platforms use model-viewer or variants thereof which also doesn't support embedded audio in glTF files yet - my understanding is that it might at some point in the future once a) this extension is ratified and b) it's implemented in three.js. In the meantime, you'll most likely need custom solutions/custom viewers.
Ok thanks, that's saved me a lot of time searching for ways of doing it. Will keep my eyes on this thread for any future developments in this area and feel free to link me up to any other threads or rooms that you think might be worth keeping an eye on. Would be a huge step forward being able to bake wavs into gltfs!
On Fri, 9 Sep 2022, 20:33 hybridherbst, @.***> wrote:
@DanAbel77 https://github.com/DanAbel77 Sketchfab doesn't support embedded audio in glTF files, you need to upload it manually. Most NFT platforms use model-viewer or variants thereof which also doesn't support embedded audio in glTF files yet - my understanding is that it might at some point in the future once a) this extension is ratified and b) it's implemented in three.js. In the meantime, you'll most likely need custom solutions/custom viewers.
— Reply to this email directly, view it on GitHub https://github.com/KhronosGroup/glTF/pull/2137#issuecomment-1242384018, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZNNF3R2CFB64XQFGRONFDDV5OGHHANCNFSM5SG7HYUA . You are receiving this because you were mentioned.Message ID: @.***>
I've added a PR for the first sample asset here: https://github.com/KhronosGroup/glTF-Sample-Models/pull/360/
I'm still working on open sourcing all our tooling for KHR_audio. We released the Third Room Unity Exporter yesterday which has support for exporting KHR_audio.
We also have an implementation of KHR_audio for glTF-transform. I'm still working on the PR to get this upstream into glTF-Transform.
And then Third Room itself has an implementation of loading KHR_audio assets: https://thirdroom.io
We also have a glTF viewer where you can drag and drop any models to test them out: https://thirdroom.io/viewer
And finally we have a hosted version of our glTF transform pipeline available here: https://thirdroom.io/pipeline
No pitch
field for sources?
May I ask, how the looping is supposed to be configured, regarding the run-up and run-down part of a sound? Imagine a 5 seconds long locomotive horn. It consists of 3 sections: run-up, retention, run-down. In this scenario actually only the retention part needs looping, depending on the desired total length.
- If the user pushes e.g. the
button for a short enough time, then only the run-up part is needed to be played, directly followed by the run-down part. - If the user holds the button for a long time, then after playing the run-up part, the middle retention part needs to be looped for the desired length, then at the end the run-down part follows.
- The middle part may also be needed to be sub-sectioned, because if the user holds the control key for e.g. only 1 s, then there should be a way to cut the loop at the middle, but preferably at predefined points by the author. Just cutting the sound at a random point and merging the run-down part can cause an audible glitch.
Would it be possible to define audio in a very compartmentalized manner. There would be the overall structure of audio that allows an audio source (separate sub-extension for each media type) and several different emitters (also sub-extensions). There may even be the possibility for filters and mixers between the audio sources. This suggestion is somewhat along the lines of Web Audio (https://www.w3.org/TR/webaudio/).
It would mean that any implementation would need to implement at least one audio source and one audio emitter. A web-based implementation may be able to integrate with the WebAudio API.
I know this is much more complex than the original comments were discussing. Perhaps it would be good to have a special case audio extension (single source type, single non-spatialized emitter, no filtering) and a more general purpose one.
There are still some outstanding questions on audio formats, MP3 vs WAV, and what features within these formats should be supported.
My personal opinion on this subject:
Basically, you need at least one lossless (for any serious professional audio editing) and one lossy codec, mainly designed to minimize the size of the audio data. I suggest flac (wav is a bit outdated) and mp3 (the other variants were more or less developed due to already expired patents). Other common formats can be optionally supported (e.g. as in @antpb comment).
@capnm is FLAC supported in major engines? From my quick research Unity only supports .aif.wav.mp3 and .ogg
We would be specifying that all engines would need to natively support FLAC. While I agree it is a superior file type in a lot of ways, the history of support in wav across engines and browser implementations feels safer. Implementors get it as a freebie vs having to get engines like Unity to support FLAC natively.
Some engines are at least already on the way to support it (slowly realize the massive advantages of open standards like glTF ;) You can convert and cache it locally to any format you need with relative ease. I would specify that mp3 format support is mandatory, and for the lossless option you can fall back to wav until the recommended flac support is implemented...
@capnm is FLAC supported in major engines? From my quick research Unity only supports .aif.wav.mp3 and .ogg
We would be specifying that all engines would need to natively support FLAC. While I agree it is a superior file type in a lot of ways, the history of support in wav across engines and browser implementations feels safer. Implementors get it as a freebie vs having to get engines like Unity to support FLAC natively.
Unity supports FLAC since version 2020.1.0. It's in the release notes but is not in the documentation even though it does work.