proposals icon indicating copy to clipboard operation
proposals copied to clipboard

XRCapture Module

Open alcooper91 opened this issue 2 years ago • 44 comments

I'd like to propose a module to allow for recording the currently rendered contents of WebXR sessions. I realize this has previously been discussed in Issue #36 , but it seems like secondary-views would not capture all of the scene, and further that it would omit capture of things such as DOM Overlay or XRLayers. (Please correct me if I'm wrong?)

I've drafted an explainer, which can be found here.

cc: @cabanier @toji

/agenda in the hopes of discussing this as well.

alcooper91 avatar Oct 08 '21 23:10 alcooper91

/tpac discuss XRCapture Module

cabanier avatar Oct 09 '21 02:10 cabanier

Great TPAC topic!

AdaRoseCannon avatar Oct 09 '21 11:10 AdaRoseCannon

I think such a thing would be very useful; akin to how Hololens allows an image capture to take place that synthesizes a view from the forward camera and graphics.

~~Big privacy implications, so it would need to be gated similarly to camera access.~~ (see @alcooper91's comment below)

blairmacintyre avatar Oct 13 '21 11:10 blairmacintyre

It's worth noting, from a privacy perspective, that the API I am proposing would capture an image/recording and save it to disk. The site would not get access to that image/recording unless the user then later chose to upload it to the site. This is similar to if the user just invoked the native runtime mechanisms to take that image, but these aren't always easily accessible when in a Session.

alcooper91 avatar Oct 13 '21 15:10 alcooper91

Discussion on this should continue in the next regular call, but may be worth having some discussions here in the meantime.

Summary of key opinions/statements from the TPAC call, apologies in advance if I forget or misrepresent something: @nbutko had a preference for being able to get access to MediaStreams/MediaTracks for mixing the recording (e.g. adding watermarks, custom audio tracks, and encoding to non-webm format) @AdaRoseCannon had slight opposition to the WebShare integration, instead preferring using an opaque id. (I'll note that personally, I feel that the WebShare API integration provides an ergonomics benefit to users/developers, but that given there are existing ways of doing this and other proposals that would smooth this route don't feel too strongly about the benefit) @cabanier seemed to vote in opposition to working on this, had previously asked about whether this could provide hinting for summoning the system UI.

alcooper91 avatar Oct 14 '21 21:10 alcooper91

On both Hololens and Quest, users are already familiar with how they can record sessions. This introduces another way to do the same which might conflict.

I would be more in favor for an API that brings up an OS or UA dialog that gives users the ability to record their session.

cabanier avatar Oct 14 '21 21:10 cabanier

@nbutko had a preference for being able to get access to MediaStreams/MediaTracks for mixing the recording (e.g. adding watermarks, custom audio tracks, and encoding to non-webm format)

It's also worth mentioning that VR headsets would not require additional permissions due to the camera feed.

On both Hololens and Quest, users are already familiar with how they can record sessions.

There are also system level mechanisms for recording screens on iOS and Android. However, we have found that these lack ergonomics around discovery, convenience, and sharing and don't substitute for in-app recording flows. Additionally, watermarks, audio mixing and transcoding are compelling use cases for our current WebAR customers.

nbutko avatar Oct 14 '21 21:10 nbutko

On both Hololens and Quest, users are already familiar with how they can record sessions.

There are also system level mechanisms for recording screens on iOS and Android. However, we have found that these lack ergonomics around discovery, convenience, and sharing and don't substitute for in-app recording flows. Additionally, watermarks, audio mixing and transcoding are compelling use cases for our current WebAR customers.

I'm doubtful that we can run a WebXR session at acceptable performance if we do "watermarks, audio mixing and transcoding" at the same time. UAs can certainly bring up their own dialog on platforms where recording is not or poorly supported but otherwise, it should be preferred.

cabanier avatar Oct 14 '21 21:10 cabanier

One of the stated goals is also to be able to capture DOMOverlays and XRLayers; where DOM overlays could theoretically embed iFrames and thus there would still be some privacy restrictions. Given this, I think there still needs to be some opaque way of capturing a recording.

I believe that requiring encoding in a more portable format (mp4) should mitigate some of the need for transcoding; and it would seem that it should be possible to draw a watermark while the image is being recorded (if you know that such a recording is happening?)

Apart from ease of use/discoverability, I think that this mechanism can also provide the ability to initiate the captures without forcing the user almost out of the session (as would be required on handheld or fullscreen devices), as well as providing the page hints that the image capture/recording is done so that it can prompt if the user would like to upload/share the recording.

Absent an API like this, I don't know that UAs have a mechanism to bring up a dialog to do such a recording.

alcooper91 avatar Oct 14 '21 21:10 alcooper91

One of the stated goals is also to be able to capture DOMOverlays and XRLayers; where DOM overlays could theoretically embed iFrames and thus there would still be some privacy restrictions. Given this, I think there still needs to be some opaque way of capturing a recording.

Capturing the session with domoverlay and layers by the OS, should be safe since no third parties can have access to them.

I believe that requiring encoding in a more portable format (mp4) should mitigate some of the need for transcoding; and it would seem that it should be possible to draw a watermark while the image is being recorded (if you know that such a recording is happening?)

Recording type and quality should be decided by the UA. Why would you want to watermark the output? Is it so the site can restrict distribution?

Apart from ease of use/discoverability, I think that this mechanism can also provide the ability to initiate the captures without forcing the user almost out of the session (as would be required on handheld or fullscreen devices), as well as providing the page hints that the image capture/recording is done so that it can prompt if the user would like to upload/share the recording.

Absent an API like this, I don't know that UAs have a mechanism to bring up a dialog to do such a recording.

If on Quest, you hit the oculus button during an immersive session, you get the option to record it without being thrown back to 2D. I assume Hololens has a similar mechanism; do you know @fordacious @RafaelCintron?

cabanier avatar Oct 14 '21 21:10 cabanier

One of the stated goals is also to be able to capture DOMOverlays and XRLayers; where DOM overlays could theoretically embed iFrames and thus there would still be some privacy restrictions. Given this, I think there still needs to be some opaque way of capturing a recording.

Capturing the session with domoverlay and layers by the OS, should be safe since no third parties can have access to them.

To clarify, this was a point against MediaTracks/Streams

Apart from ease of use/discoverability, I think that this mechanism can also provide the ability to initiate the captures without forcing the user almost out of the session (as would be required on handheld or fullscreen devices), as well as providing the page hints that the image capture/recording is done so that it can prompt if the user would like to upload/share the recording. Absent an API like this, I don't know that UAs have a mechanism to bring up a dialog to do such a recording.

If on Quest, you hit the oculus button during an immersive session, you get the option to record it without being thrown back to 2D. I assume Hololens has a similar mechanism; do you know @fordacious @RafaelCintron?

Right, I'm thinking about mobile AR scenarios that don't have easy things like this, plus having a lower-friction surface to do so available (e.g. one or two taps, rather than summoning a menu). The native functionality that I'm comparing this with in SceneViewer allows the app to simply have a button that takes a picture/video with no prompt.

alcooper91 avatar Oct 14 '21 21:10 alcooper91

Given this, I think there still needs to be some opaque way of capturing a recording.

Canvas taint provides a good existing model for this.

nbutko avatar Oct 14 '21 22:10 nbutko

If on Quest, you hit the oculus button during an immersive session, you get the option to record it without being thrown back to 2D. I assume Hololens has a similar mechanism; do you know @fordacious @RafaelCintron?

Right, I'm thinking about mobile AR scenarios that don't have easy things like this, plus having a lower-friction surface to do so available (e.g. one or two taps, rather than summoning a menu). The native functionality that I'm comparing this with in SceneViewer allows the app to simply have a button that takes a picture/video with no prompt.

We could have an API that on Quest/Hololens opens the system menu but on AR devices, it brings up a confirmation prompt or dialog rendered by the UA (or no dialog at all in case of a picture). It would be confusing to our users to have 2 separate ways to do screen recordings which would negate the lower-friction part.

cabanier avatar Oct 14 '21 22:10 cabanier

I don't think it would be confusing to users to have 2 separate ways to do screen recordings. I think there's plenty of examples of apps that do their own camera integration (e.g. snapchat and even facebook messenger provide ways to change up your camera feed with a "take picture" button), and the capture API for HoloLens allow developers to write this custom kind of capture experience as well: https://docs.microsoft.com/en-us/windows/mixed-reality/develop/platform-capabilities-and-apis/mixed-reality-capture-for-developers#integrating-mrc-functionality-from-within-your-app

I think it would be more confusing to developers to have two different APIs to initiate capture, one which summons a system UI (if such a thing is even available/present), and one that would go through a UA prompt with the implementation being based on what the runtime supports. The UA can still choose to invoke their system UI when capture is requested, but I think there are some potential issues from a developer expectation POV if the developer only requested a screenshot and the user then changes to a video, and the developer doesn't have a way to stop the video.

alcooper91 avatar Oct 14 '21 22:10 alcooper91

We can't really control apps that do their own thing. They are free to record the screen and use it however they want.

I think it would be more confusing to developers to have two different APIs to initiate capture

I'm not proposing that there are 2 APIs. I want 1 API that invokes the system capabilities if they are available or that invokes a UA dialog (if needed) if there are none.

I think there are some potential issues from a developer expectation POV if the developer only requested a screenshot and the user then changes to a video, and the developer doesn't have a way to stop the video.

Why would it not OK for the user to record a video if they choose to do so? Are you envisioning that the experience changes if it detects that it's being recorded?

the developer doesn't have a way to stop the video.

We could provide an API to stop recording if there's a reason for the experience to have control over that.

cabanier avatar Oct 15 '21 02:10 cabanier

We can't really control apps that do their own thing. They are free to record the screen and use it however they want.

I'm not sure I understand what you're saying here? I was pointing to those apps as examples of things that expose separate ways to invoke screenshots as cases where an API like this wouldn't be out of place in allowing pages to build their own recording experience.

I'm not opposed to the UA showing the system API if that's the UA's choice; but I don't want to change the expectation of the app on if it's taking a screenshot or a video out from under them.

Recording is inherently more expensive than a simple screenshot so if the app knows it's being recorded, it may want to scale back the quality of some of the models or intentionally throttle it's own frame rate, perhaps there are animations that it would want to sync up to the start of a recording. Further, the app may want to, once it knows the capture has finished, switch/enable a "share" button (which whether that invokes the file picker->Web Share API or a method on the capture object is a separate point of debate). Most significantly, I'd already proposed the app having control over the "stopRecording" function so that they could have their own UI around that, styled to be consistent with their experience; if an app is expecting (because it only ever requests) a screenshot, but then the user has started a recording, it may not have a stop recording button, and forcing such a requirement feels like it would make the API less attractive to developers, and add a further burden on UAs to add some form of "stop recording" button, which could potentially muddy up the UX by adding extra interaction points. (UAs are still free to programatically terminate the recording if they feel a site is abusing the recording length).

alcooper91 avatar Oct 15 '21 16:10 alcooper91

/agenda I believe part of the TPAC followup was to discuss this in a call.

alcooper91 avatar Nov 02 '21 16:11 alcooper91

We discussed this in today's WG meeting but I didn't feel like we got closer to a consensus. To clarify, these are the features that I believe the API should have:

  • Have a simple API to create a screenshot or start/stop a screen recording.
  • Allow for the UA to decide on quality, encoding, image type, etc.
  • Have well defined behavior if the system is already recording and account for other edge cases.

The API should NOT:

  • present a UA drawn configuration dialog box.
  • enforce that the recording is made by the UA and not the system
  • mandate that only UA drawn UI is recorded (ie no system notifications, guardian, etc).
  • provide a way for the session to detect that it is recorded.

I'm also a bit uneasy about your suggestion that the user can immediately share the recorded session. In case of Android and Hololens, that would show the user's environment and it seems that there should be some type of warning before that goes out.

cabanier avatar Nov 17 '21 05:11 cabanier

We are advocating for consistency of experience across Desktop 3D (canvas, no camera feed), Mobile WebAR (canvas, camera feed), and headset (WebXR). Particularly, we would like to take existing flows and allow for them in headset sessions.

In the current flow,

  • User grants permission for camera feed (they already do this to get WebAR)
  • User potentially grants permission for audio
  • User taps a dom-based button to capture a photo or holds the button to record.
  • The contents of the canvas are captured (no other DOM).
  • Content can be overlayed on top of the captured canvas data (wartermarks, borders / frames, etc.), and other effects can be introduced, such as fading to an end card with a CTA embedded directly in the video.
  • The user can preview the image or video that was captured.
  • The user can share the image or video that was captured.

We do not require capturing the full DOM as a requirement. In fact, including the record button in the video is undesirable.

In a headset session, it's possible that this means only recording from one layer.

nbutko avatar Nov 17 '21 18:11 nbutko

I think we're mostly in agreement on the items you mentioned with two points that I'd like to clarify.

I don't think that

The API should NOT:

  • present a UA drawn configuration dialog box.

should be an explicit requirement. (Unless you meant mandate instead of present), as I think it should be a UA choice whether they invoke system UI or their own Dialog box. (e.g. in-line with not enforcing if the UA/system does the recording, a given platform may not have a system recording mechanism and so the implementation may be done by the UA).

I'm fine with the requirement that we:

Allow for the UA to decide on quality, encoding, image type, etc.

with the caveat I mentioned today, where I think the broader type of recording (e.g. photo vs video) shouldn't be changed out from under the page, as otherwise the page may not be showing proper UI to stop the recording. The UA/System likely does still need to provide a mechanism to stop the recording though, to prevent abuse by sites that could simply not call a "stop" button.

It does push a little bit of additional burden onto the developers and I'm not sure how I feel about it, but if needing to change the type was a critical path, we could modify the return type to indicate which type of recording was started so that the page could respond appropriately, and still provide a good user experience. (I think essentially this amounts to collapsing my proposed XRCapture/XRVideoCapture interfaces and adding an enum).

As far as immediately sharing the recorded session, I think it's more accurate to say that it allows a kick-off of the WebShare API, which doesn't allow the site to influence the share targets, and is essentially the same as invoking the native "Share" functionality on an object. However, with that being said, that proposed integration is more for developer convenience than anything else, and is something I'd like to explore further once we have agreement that speccing an XRCapture API is something that we should move forward with.

For Nick, if you don't require capturing DOM, then your use case should be met once UAs are able to enable Raw Camera Access (albeit as Rik mentioned yesterday, this may not be the most performant); one of the key requirements that I've been targeting with this API is that there are other developers for whom capturing their DOM Overlay elements is a requirement.

alcooper91 avatar Nov 17 '21 18:11 alcooper91

In a headset session, it's possible that this means only recording from one layer.

Headsets are the only devices that implement layers at this point, but there's no reason to expect that they'll always be unavailable to mobile AR. True, they don't provide as many benefits in that environment, but normalizing the API across form factors where possible is a good goal.

toji avatar Nov 17 '21 18:11 toji

Headsets are the only devices that implement layers at this point, but there's no reason to expect that they'll always be unavailable to mobile AR. True, they don't provide as many benefits in that environment, but normalizing the API across form factors where possible is a good goal.

Perhaps I should have written: In sessions with multiple layers, perhaps this means only recording from one layer.

as Rik mentioned yesterday, this may not be the most performant

Performance is a key requirement here. If there is a performant way to capture a single layer, it would go a long way.

nbutko avatar Nov 17 '21 18:11 nbutko

As an example of what's currently supported and expected by developers on the web, this is the 8th Wall Media Recorder API, which is widely used by 8th Wall's developers:

https://www.8thwall.com/docs/web/#xr8mediarecorder

nbutko avatar Nov 17 '21 18:11 nbutko

Here's an example of that API in action: https://www.8thwall.com/alivenow/freefire

You can take a photo or record a video. Recorded videos include overlayed 2D UI elements and a custom end card.

Example recording:

https://user-images.githubusercontent.com/25936010/142264116-00fb9896-0c89-4b77-821c-ab780d886fbb.mp4

nbutko avatar Nov 17 '21 18:11 nbutko

There are other developers for whom capturing their DOM Overlay elements is a requirement.

I would try to assess what the true product requirement is here -- is it truly to represent all DOM elements including passwords, credit card numbers in stripe iframes and other sensitive fields? Or is it a mechanism for mindfully injecting specific 2D content on top of the video? I would expect the latter, since this is the common use case we see, and handle.

nbutko avatar Nov 17 '21 18:11 nbutko

In a headset session, it's possible that this means only recording from one layer.

Headsets are the only devices that implement layers at this point, but there's no reason to expect that they'll always be unavailable to mobile AR. True, they don't provide as many benefits in that environment, but normalizing the API across form factors where possible is a good goal.

Users would expect that a capture will show the entire scene. If the author used a media layer for video and an equirect or cube layer, it would be strange if those weren't recorded.

cabanier avatar Nov 18 '21 17:11 cabanier

The API should NOT:

  • present a UA drawn configuration dialog box.

should be an explicit requirement. (Unless you meant mandate instead of present), as I think it should be a UA choice whether they invoke system UI or their own Dialog box.

Yes, the UA can choose to show a dialog when record or capture is called. What I meant was that the API shouldn't mandate a method that invokes a dialog and returns a list of options to the author that are used later for capturing.

Capturing must definitely show something to the user to make sure that their experience isn't recorded secretly.

I'm fine with the requirement that we:

Allow for the UA to decide on quality, encoding, image type, etc.

with the caveat I mentioned today, where I think the broader type of recording (e.g. photo vs video) shouldn't be changed out from under the page, as otherwise the page may not be showing proper UI to stop the recording. The UA/System likely does still need to provide a mechanism to stop the recording though, to prevent abuse by sites that could simply not call a "stop" button.

We also need to consider what should happen if the system was already making a recording. The site should not be able to turn that off or detect it.

It does push a little bit of additional burden onto the developers and I'm not sure how I feel about it, but if needing to change the type was a critical path, we could modify the return type to indicate which type of recording was started so that the page could respond appropriately, and still provide a good user experience. (I think essentially this amounts to collapsing my proposed XRCapture/XRVideoCapture interfaces and adding an enum).

That would cover the case If the UA asks the system to do a screen grab but the user backs out of that option and elects to record instead. I'm leaning towards marking such a thing as a failure instead of asking the page to react to it.

As far as immediately sharing the recorded session, I think it's more accurate to say that it allows a kick-off of the WebShare API, which doesn't allow the site to influence the share targets, and is essentially the same as invoking the native "Share" functionality on an object. However, with that being said, that proposed integration is more for developer convenience than anything else, and is something I'd like to explore further once we have agreement that speccing an XRCapture API is something that we should move forward with.

OK, that's reasonable.

cabanier avatar Nov 18 '21 19:11 cabanier

We also need to consider what should happen if the system was already making a recording. The site should not be able to turn that off or detect it.

Similar to the issue you later mention:

That would cover the case If the UA asks the system to do a screen grab but the user backs out of that option and elects to record instead. I'm leaning towards marking such a thing as a failure instead of asking the page to react to it.

The user could also back out of granting permission to take the capture or recording, so I think ensuring that all of those cases (capture type changed, user declined permission when prompted, capture is ongoing), are all reported as the same type of failure to the page, the page will know that a capture did not start, but not necessarily why a capture failed to start.

We are advocating for consistency of experience across Desktop 3D (canvas, no camera feed), Mobile WebAR (canvas, camera feed), and headset (WebXR).

Any such WebXR capture API would be available across the supported session types (albeit there's likely some runtime implementation delta), so I don't think that needs to be a concern here; unless you aren't intending to use inline sessions for the Desktop 3D case, which opens a whole different set of worms, as any such API to unify those recordings would be out of scope of the Immersive Web Group (likely falling under WebRTC), and similar APIs that give access to the streams that you can manipulate have, as I understand it, met with push back from various browser vendors in that group.

I would try to assess what the true product requirement is here -- is it truly to represent all DOM elements including passwords, credit card numbers in stripe iframes and other sensitive fields? Or is it a mechanism for mindfully injecting specific 2D content on top of the video? I would expect the latter, since this is the common use case we see, and handle.

@elalish is the primary user I've spoken to that cares about capturing DOM elements; specifically (IIUC), he is interested in capturing the DOM elements that have been incorporated into the scene (which I believe often includes hotspots/annotations for models); but does not care if the site that is hosting the content actually ever gets access to them. I think an API to do this (which has parity with native features/capturers), is fundamentally different than the features that you want, since you want more access to manipulate the recording, but don't care about an increased user friction to do so.

As I've stated before, if your use case does not involving capturing the DOM, and requires accessing streams that allow you to manipulate the recording directly, all of that is (or will be once Raw Camera Access ships) available to you today, since the camera feed is the only content that you don't control. Accessing a stream that contains the DOM Content requires a much higher privacy bar, that quite honestly, I haven't been able to devise suitable mitigations for Android to allow shipping getDisplayMedia (relevant chrome bug) yet, and we'd need similar mitigations for this API.

@cabanier, is exposing a capture mode that would expose the raw streams (barring privacy mitigations), something that it would be technically feasible for you all to expose either? It sounds like your thoughts for implementing this API would be to hook directly into your system-level capturers, which would have similar privacy concerns since system UI and similar would be exposed, and you likely couldn't implement a mode that would strip out any DOM content?

If we set aside the potential privacy issues a moment and say that we are able to come up with a satisfactory solution that would grant you access to streams, I think such an enhancement would be possible to be plugged into the API shape I propose, but even still, given that a mechanism to do what you want exists today, we'd likely still want and prioritize a more privacy-preserving API that has less user and developer friction before implementing such an enahancement.

alcooper91 avatar Nov 19 '21 00:11 alcooper91

Indeed, I've been pushing for this feature because it's the last major gap between what WebXR can do and what SceneViewer (the native app) can for AR (try it here). One of their most-used features is the record button, which people seem to use a lot for making silly pictures/videos of people next to AR renders. Still, it's nice for commerce too (send your partner a picture of the sofa you're considering placed in your living room). However, it's a rather jarring experience if what you see is not what you get. Consider that we use DOM elements for things like showing dimensions; it's a surprising screen recording if it's not actually recording the whole screen. The smoothness of the flow is also key; WebXR permissions are a huge block already, so the last thing we need to is make the experience even bumpier.

elalish avatar Nov 19 '21 01:11 elalish

Currently recording with 8th Wall's recorder allows overlay of 2D content by drawing to a foreground canvas (usually with a 2D context) which gets composited over the 3d canvas. This provides a lot of flexibility and sounds like it could be used effectively to annotate hotspots, etc. without the inherent security risks of fully general dom capture.

nbutko avatar Nov 19 '21 01:11 nbutko