webxr Consider hooking up sound source nodes in the API somehow

Consider hooking up sound source nodes in the API somehow

Open cwilso opened this issue 5 years ago • 30 comments

There have been requests to add sound to the scope of the WebXR API . There are two aspects to this - first, that we should manage the audio input/outputs associated with an XR device. This is already covered by https://github.com/immersive-web/webxr/issues/98. The second aspect is enabling developers to easily position sound sources in the virtual space, and use an HRTF (Head-Related Transfer Function) or multi-speaker setup to properly "position" the sound.

It is relatively straightforward to use Web Audio's PannerNode to hook up between a posed sound source and the head pose - in fact, three.js does exactly this, with a PositionalAudio source object. However, the problem lies in keeping the headpose (and the sound source pose) updated on a high enough frequency - ideally, letting the audio thread directly get headpose info somehow or the like.

(Note that I don't consider this a high-priority today - Issue #98 is more important, and even that is a future enhancement - but I wanted to capture it.)

Sep 06 '18 18:09 cwilso

Why wouldn't we use existing web api's (getUserMedia, webAudio)? How are they lacking, in a way that couldn't be solved by updating them?

As with accessing video, it seems like enhancing existing web APIs, or perhaps somehow creating a binding between them, would be preferable to creating a new, different API. For audio, in particular, webAudio seems pretty good, and if the issue is synchronizing the headpose of the audio for spatialization, this seems like something that could be solved with a small additional feature in webAudio.

Sep 07 '18 20:09 blairmacintyre

The problem is that you have to manually send the pose data to webAudio, which carries some overhead.

Ideally, WebAudio would have a mode where it could be told "the position of this panner node is to reflect the head pose" and then it uses realtime head pose data.

Sep 09 '18 07:09 Manishearth

@Manishearth hit the nail on the head. We would (presumably) use media streams and Web Audio. Web Audio even already has 3D positioning - and yes, the key missing piece is synchronizing the poses (it's not just headpose - it's also the pose of each individually-placed sound-producing object) - or more to the point, minimizing the latency of keeping those poses updated, and getting them updated in the audio thread on a regular basis.

It's possible this is just advice and best practice for web audio; it's possible we'd want a small feature tying web audio (PannerNode or a derivative) to an XR Session and some poses. It may not turn in to an actual feature in XR - but that all needs to be explored, and this seems like the best place to track it to me.

Sep 10 '18 01:09 cwilso

@cwilso @Manishearth yes, exactly the approach I was imagining. Getting someone to explore this would be great.

Sep 11 '18 10:09 blairmacintyre

@kearwood and I discussed this a bit and he brought up that a nice API for this would be to allow attaching XRSpaces to the AudioListener and PannerNodes, and implementors can internally use the XRSpace reference from the render thread to quickly query (or request push updates for) positional information for the relevant objects

Jan 30 '19 17:01 Manishearth

It seems that for a V1 of this integration, it may be reasonable to implement a function that would be called within each XRSession.requestAnimationFrame callback to explicitly synchronize information about poses across to WebAudio.

This provides benefit in that there will be no need to manually copy members across, but would not be a "set and forget" that updates its position continuously.

There may need to be some kind of smoothing or interpolation to avoid pops and clicks as tracking state is lost, regained, or operating at various sampling rates.

The focus could be on supporting headphone-based 3d spatialization for the majority of the cases, and additively support things such as speaker arrays in a CAVE system later.

Security implications of leaking poses are avoided by requiring state to be explictly transferred during XRSession.requestAnimationFrame.

Jan 30 '19 21:01 kearwood

As a page will not be receiving pose updates while blurred (eg, while a system dialogue is displaying a permission request), content would be required to explicitly handle ducking and/or muting directional audio that may be distracting and/or feel broken when no longer tracking the pose updates.

Jan 30 '19 21:01 kearwood

While it may be interesting to use XR device sensing and world awareness for features such as selecting an appropriate reverb impulse response to match the room shape, this would be a non-goal for v1 as the security model required has not been described.

Jan 30 '19 21:01 kearwood

(Summing up lunchtime conversation between @kearwood, @Manishearth and myself)

I still think "v1" of audio-in-XR is what Kip described, and is possible today: developers can implement code within their XRSession.requestAnimationFrame callback to explicitly synchronize the headpose and XRReferenceSpace to WebAudio PannerNodes and AudioListener, respectively.

v2 of audio-in-XR is making this connection automatic - we can incubate this as partial interfaces off Web Audio API interfaces for AudioListener and PannerNode, I expect. The implementation will raise more security and privacy concerns (e.g. Kip pointed out that we'd have to make sure that the blurring and blocking of pose data that happens when prompts are on-screen, e.g., would have to be applied to this too).

v3 is probably looking at more advanced world awareness (e.g. reverb based on room), which likely has even more concerning security and privacy implications.

Jan 30 '19 22:01 cwilso

For V1, I would like to have some language in explainer.md under the "Viewer tracking" header that show exactly how to currently connect the web audio API to WebXR. I'm thinking there could be a heading level 4 for visual viewing, and another heading level 4 for auditory viewing. We need to convert the WebXR's orientation quaternion into a direction vector in the frame. I think we need to do a Rodrigues formula. I'm not sure if we just take the first item in the views list, and I'm not sure of the exact formula needed. Here is an example with the location for the conversion code as a set of comments, because I'm not exactly sure what needs to be done:

// initialize the audio context
const AudioContext = window.AudioContext || window.webkitAudioContext;
const audioCtx = new AudioContext();

function onDrawFrame(timestamp, xrFrame) {
	// Do we have an active session?
	if (xrSession) {
		let listener = audioCtx.listener;

		let pose = xrFrame.getViewerPose(xrReferenceSpace);
		if (pose) {
			// Run imaginary 3D engine's simulation to step forward physics, PannerNodes, etc.
			scene.updateScene(timestamp, xrFrame);

			const view = pose.views[0];
			// Do something here to get the rotation and position in a direction vector from the view quaternion
			// set all the listener attributes to have a value of the vector.

		}
		// Request the next animation callback
		xrSession.requestAnimationFrame(onDrawFrame);
	}
}

Nov 19 '19 18:11 frastlin

You don't need any special math for this, provided that you place your panner nodes appropriately based on your xrReferenceSpace, you can just use getViewerPose's transform's position/orientation directly.

Ideally, though, we should have an API that allows for realtime linkage behind the scenes, where you "set and forget" an XRSpace on the listener node and the updates happen without going through JS.

Nov 19 '19 18:11 Manishearth

OK, what xrReferenceSpaces can translate directly to the vector in Web Audio? Also, what is the order of arguments? I would like to put an example in explainer.md that shows how to do this now, because any application with 3D/VR sound will need to use this algorithm.

Nov 19 '19 19:11 frastlin

OK, what xrReferenceSpaces can translate directly to the vector in Web Audio?

It doesn't matter, as long as it's origin is stationary (so, not viewer). Just use local or something. Everything is relative in WebAudio, so as long as all numbers are in the same coordinate space it should be fine. I don't know what you mean by the order of arguments, listener has setPosition and setOrientation methods. Just use those, and place the panner nodes appropriately. There's no algorithm here.

Nov 19 '19 19:11 Manishearth

If the values are the same, then the example would look something like this?

const view = pose.views[0];
[ listener.positionX.value, listener.positionY.value, listener.positionZ.value, listener.forwardX.value, listener.forwardY.value, listener.forwardZ.value, listener.upX.value, listener.upY.value, listener.upZ.value ] = view

Will this work? The values in the view are a 4 by 4 array (16 values), and here we are looking for 9 values. This is what I mean by order of arguments.

Nov 19 '19 20:11 frastlin

Just use pose.transform.position and pose.transform.orientation with listener.setPosition() and listener.setOrientation(). In setPosition() make sure to normalize by dividing x, y, and z by w.

Nov 19 '19 20:11 Manishearth

Perfect, thank you! So the example would be:

// initialize the audio context
const AudioContext = window.AudioContext || window.webkitAudioContext;
const audioCtx = new AudioContext();

function onDrawFrame(timestamp, xrFrame) {
	// Do we have an active session?
	if (xrSession) {
		let listener = audioCtx.listener;

		let pose = xrFrame.getViewerPose(xrReferenceSpace);
		if (pose) {
			// Run imaginary 3D engine's simulation to step forward physics, PannerNodes, etc.
			scene.updateScene(timestamp, xrFrame);

			// Set the audio listener to face where the XR view is facing
			[ listener.forwardX.value, listener.forwardY.value, listener.forwardZ.value ] = pose.transform.orientation;
			// Set w to 1 as stated in the WebXR spec:
			const w = 1;
			// Set the audio listener to travel with the WebXR user position
			[ listener.positionX.value, listener.positionY.value, listener.positionZ.value ] = pose.transform.position.map(p=>p/w);

		}
		// Request the next animation callback
		xrSession.requestAnimationFrame(onDrawFrame);
	}
}

Nov 19 '19 21:11 frastlin

Oh if w is always 1 you don't need to divide, then.

Nov 19 '19 21:11 Manishearth

But yeah, that's correct. You can use setPosition() and setOrientation() to do it atomically, different browsers handle checkpointing differently here.

Nov 19 '19 21:11 Manishearth

Those two functions are unfortunately deprecated

Nov 19 '19 21:11 frastlin

[ listener.forwardX.value, listener.forwardY.value, listener.forwardZ.value ] = pose.transform.orientation;

That looks wrong. pose.transform.orientation is a quaternion which describes a 3D rotation, you can't just take its first three components and assign them to a direction vector. Instead, you'd need to take a forward vector, i.e. (0, 0, -1) assuming -z is forward, and apply the quaternion to it as a rotation operation.

Following https://en.wikipedia.org/wiki/Quaternions_and_spatial_rotation#Quaternion-derived_rotation_matrix , the result should be -1 (the Z component of the unrotated forward vector) times the third column of the rotation matrix.

fwd.x = -2 * (q.x*q.z + q.y * q.w);
fwd.y = -2 * (q.y*q.z - q.x * q.w);
fwd.z = 2 * (q.x * q.x + q.y * q.y) - 1;

This is untested and may be the wrong sign or transposed, but that's roughly how it should look, assuming the input quaternion is normalized. If you're using a JS framework, that should provide utility methods for such things.

Nov 19 '19 21:11 klausw

Oh, I didn't realize WebAudio orientations weren't quaternions, my bad

Nov 19 '19 21:11 Manishearth

If you don't want to deal with quaternions, using the matrix representation may be more useful. See https://immersive-web.github.io/webxr/#matrices for details.

The pose matrix's top left 3x3 elements provide unit column vectors in base space for the posed coordinate system's x/y/z axis directions, so you could use the negative of the third column directly as a forward vector corresponding to the -z direction:

let m = pose.transform.matrix;
let fwd = {x: -m[8], y: -m[9], z: -m[10]};

Nov 19 '19 21:11 klausw

So this would be the actual example:

// initialize the audio context
const AudioContext = window.AudioContext || window.webkitAudioContext;
const audioCtx = new AudioContext();

function onDrawFrame(timestamp, xrFrame) {
	// Do we have an active session?
	if (xrSession) {
		let listener = audioCtx.listener;

		let pose = xrFrame.getViewerPose(xrReferenceSpace);
		if (pose) {
			// Run imaginary 3D engine's simulation to step forward physics, PannerNodes, etc.
			scene.updateScene(timestamp, xrFrame);

			// Set the audio listener to face where the XR view is facing
			// First, convert from a quaternion to a forward vector. The pose.matrix top left 3x3 elements provide unit column vectors in base space for the posed coordinate system's x/y/z axis directions, so we use the negative of the third column directly as a forward vector corresponding to the -z direction.
			const m = pose.transform.matrix;
			[ listener.forwardX.value, listener.forwardY.value, listener.forwardZ.value ] = [-m[8], -m[9], -m[10]];
			// Set the audio listener to travel with the WebXR user position
			[ listener.positionX.value, listener.positionY.value, listener.positionZ.value ] = pose.transform.position;

		}
		// Request the next animation callback
		xrSession.requestAnimationFrame(onDrawFrame);
	}
}

Nov 19 '19 22:11 frastlin

I think you also need to set the listener "up" vector. Assuming you're using the usual convention that +Y is up, you can use the matrix's Y unit vector for that: (m[4], m[5], m[6])

Just for completeness, you could use (m[12], m[13], m[14]) for the position, it's the posed space's origin position in the base coordinate system. That should equal pose.position.xyz, but it's an alternative if you don't want to mix matrix and decomposed values in a single snippet.

Nov 19 '19 22:11 klausw

OK, this looks as if it is pretty close to being an example we can put in explainer.md:

// initialize the audio context
const AudioContext = window.AudioContext || window.webkitAudioContext;
const audioCtx = new AudioContext();

function onDrawFrame(timestamp, xrFrame) {
	// Do we have an active session?
	if (xrSession) {
		let listener = audioCtx.listener;

		let pose = xrFrame.getViewerPose(xrReferenceSpace);
		if (pose) {
			// Run imaginary 3D engine's simulation to step forward physics, PannerNodes, etc.
			scene.updateScene(timestamp, xrFrame);

			// Set the audio listener to face where the XR view is facing
			// First, convert from a quaternion to a forward vector. The pose.matrix top left 3x3 elements provide unit column vectors in base space for the posed coordinate system's x/y/z axis directions, so we use the negative of the third column directly as a forward vector corresponding to the -z direction.
			// The given pose.transform.orientation is a quaternion and not a forward vector, so is not used with web audio
			const m = pose.transform.matrix;
			// Set forward facing position
			[ listener.forwardX.value, listener.forwardY.value, listener.forwardZ.value ] = [-m[8], -m[9], -m[10]];
			// set the horizontal position of the top of the listener's head
			[ listener.upX, listener.upY, listener.upZ ] = [ m[4], m[5], m[6] ];
			// Set the audio listener to travel with the WebXR user position
			// Note that pose.transform.position does equal [m[12], m[13], m[14]]
			[ listener.positionX.value, listener.positionY.value, listener.positionZ.value ] = [m[12], m[13], m[14]];

		}
		// Request the next animation callback
		xrSession.requestAnimationFrame(onDrawFrame);
	}
}

Nov 19 '19 22:11 frastlin

OK, so the above example works for all the XRReferenceSpaces except for the basic "viewer". To make viewer work, we need to just remove the set position, as the position never moves with viewer. Will the pos matrix values be 0 in viewer? Or will the example need to check what XRReferenceSpace is being used? What caveats is there of setting the [0, 0, 0] listener pose to the native origin of WebXR?

Nov 19 '19 23:11 frastlin

Why do you want to use the viewer reference space? The whole point is to use a reference space whose origin is stationary, which is roughly true for all of them except "viewer". If your reference space isn't stationary you will have to keep updating the panner node coordinates to work in that space.

Nov 19 '19 23:11 Manishearth

I'm wondering if there needs to be a seperate example for the viewer mode, or if the above will work for viewer as well.

Nov 20 '19 00:11 frastlin

What do you mean by "viewer mode"? Which reference space you pick is irrelevant provided you pick one which is roughly stationary, in all of these cases the code will have the same result provided you pick appropriate coordinates for all the panner nodes. In all of these cases the listener will be positioned where the viewer is, because you're using getViewerPose().

The "viewer" reference space isn't stationary, it follows the viewer, and getViewerPose(viewerSpace) returns usually constant values, making it useless for this.

Nov 20 '19 02:11 Manishearth

I submitted a PR with the example to explainer.md, please edit and comment: https://github.com/immersive-web/webxr/pull/930

Nov 21 '19 17:11 frastlin

webxr webxr copied to clipboard

Consider hooking up sound source nodes in the API somehow

webxr
webxr copied to clipboard