webxr icon indicating copy to clipboard operation
webxr copied to clipboard

Detecting if an `XRInputSource` is an auxiliary or a primary input source

Open hybridherbst opened this issue 5 months ago • 50 comments

The spec just states the definitions of auxiliary and primary input sources:

An XR input source is a primary input source if it supports a primary action. An XR input source is an auxiliary input source if it does not support a primary action

but it does not provide a mechanism for applications to query if an XRInputSource does support a primary action.

Is there such a mechanism, and if not, what is the recommended approach
for applications to distinguish between auxiliary and primary input sources?

Usecase description:

  • hand tracking on Quest OS does support select events, so hands are a "primary input source" there.
  • hand tracking on Vision OS does not support select events, so hands are an "auxiliary input source" there.
  • we can emit wrapped events on Vision OS based on thumb-index distance, but then we risk sending duplicated events on Quest OS (both the wrapped event and then the system event).

Potential workaround:

  • treat all sources as auxiliary, these potentially emit wrapper events
  • once a source has received a selectstart or squeezestart event, mark it as primary and stop emitting wrapper events. While this would kind of work, it still has a risk of sending duplicate events the first time.

hybridherbst avatar Feb 05 '24 09:02 hybridherbst

I believe the intent of that is as internal spec convenience.

I'm not really convinced the use case of "we wish to wrap and supplement device events" is something designed to be supported in this regard, and using the notion of primary input devices to do so feels brittle. The proposal here solves the problem for these devices specifically, not in general.

I think wrapping until you know not to seems like an okay call to make.

I also think this can be solved in the Hands API via profiles: it does seem to make sense to expose "primary input capable hand" vs otherwise as a difference in the profiles string.

Unfortunately the current default hands profile is "generic-hands-select", which seems to imply a primary input action, not sure if we should change the default or do something else.

Manishearth avatar Feb 05 '24 10:02 Manishearth

Thanks for the comment. With "wrapping" I don't mean "pretending this is a WebXR event" – I just mean: applications need to detect "hand selection" and that needs to work independent of whether the XRInputSource hand has a select event or not.

So to summarize:

  • there's no current mechanism to distinguish between these
    • maybe in the future with more diverse hand input profiles
  • we will have to live with the "double events"

If I was to add this to the spec, would this be a valid wording:
"Input sources should be treated as auxiliary until the first primary action has happened, then they should be treated as primary."

hybridherbst avatar Feb 05 '24 10:02 hybridherbst

"Input sources should be treated as auxiliary until the first primary action has happened, then they should be treated as primary."

No, I don't think that's accurate. That is an engineering decision based on a specific use case and does not belong in the standard.

applications need to detect "hand selection" and that needs to work independent of whether the XRInputSource hand has a select event or not.

I guess part of my position is that platforms like Vision should expose a select event if that is part of the OS behavior around hands. It's not comformant of them to not have any primary input sources whatsoever: devices with input sources are required to have at least one primary one.

There's little point attempting to address nonconformance with more spec work.

There's a valid angle for devices that have a primary input but also support hands (I do not believe that is the case here). In general this API is designed under the principle of matching device norms so if a device doesn't typically consider hand input a selection then apps shouldn't either, and apps wishing to do so can expect some manual tracking. That's a discussion that can happen when there is actually a device with these characteristics.

Manishearth avatar Feb 05 '24 10:02 Manishearth

That is an engineering decision based on a specific use case

I disagree – the spec notes what auxiliary and primary input sources are but does not note how to distinguish between them. That makes it ambiguous and impossible to detect what is what.

It's not comformant of them to not have any primary input sources whatsoever

I agree and believe this is a bug in VisionOS; however, their choice may be to expose a transient pointer (with eye tracking) later (which would be the primary input source) and people still want to use their hands to select stuff.
In that case there could even be multiple input sources active at the same time – the transient one and the hand – and there would still need to be a mechanism to detect which of these is a "primary" source and which not.

hybridherbst avatar Feb 05 '24 10:02 hybridherbst

I disagree – the spec notes what auxiliary and primary input sources are but does not note how to distinguish between them

The spec is allowed to have internal affordances to make spec writing easier. A term being defined has zero implication on whether it ought to be exposed. Were "it's defined in the spec" a reason in and of itself to expose things in the API then a bunch of the internal privacy-relevant concepts could be exposed too.

The discussion here is "should the fact that a hand input can trigger selections be exposed by the API". If tomorrow we remove or redefine the term from the spec, which we are allowed to do, that wouldn't and shouldn't change the nature of this discussion, which is about functionality, not a specific spec term.

however, their choice may be to expose a transient pointer (with eye tracking) later (which would be the primary input source) and people still want to use their hands to select stuff

I addressed that in an edit to my comment above: in that case the WebXR API defaults to matching device behavior, and expects apps to do the same. There's a valid argument to be made about making it easier for apps to diverge, but I don't think it can be made until there is an actual device with this behavior, and it is against the spirit of this standard so still something that's not a slam dunk.

Manishearth avatar Feb 05 '24 11:02 Manishearth

Unfortunately the current default hands profile is "generic-hands-select", which seems to imply a primary input action, not sure if we should change the default or do something else.

In visionOS WebXR the profiles for the hand is ["generic-hand"] because it does not fire a select event.

AdaRoseCannon avatar Feb 06 '24 21:02 AdaRoseCannon

@AdaRoseCannon should we update the sepc to include that and allow it as an option?

Manishearth avatar Feb 06 '24 22:02 Manishearth

That might be sensible. It's odd because generic-hand is already included in the WebXR input profiles repo.

AdaRoseCannon avatar Feb 06 '24 22:02 AdaRoseCannon

https://github.com/immersive-web/webxr-hand-input/pull/121

Manishearth avatar Feb 06 '24 22:02 Manishearth

@AdaRoseCannon thanks for clarifying! The spec notes that

The device MUST support at least one primary input source.

but it seems that hands are the only input source on visionOS WebXR, and it's not a primary input source. Am I missing something?

hybridherbst avatar Feb 07 '24 09:02 hybridherbst

I actually think that line should probably be changed. Not all devices have input sources in the first place, and that's otherwise spec conformant.

I think it should instead be "for devices with input sources, at least one of them SHOULD be a primary input source"

Manishearth avatar Feb 07 '24 16:02 Manishearth

I don't think we need to change the primary input source requirement, simply because it should be valid to have the primary input source be transient. (This is the case for handheld AR devices, IIRC). It's somewhat unique for a device like the Vision Pro to expose persistent auxiliary inputs and a transient primary input, but I don't think that's problematic from a spec perspective. It may break assumptions that some apps have made.

I remember discussing the reasons why the hands weren't considered the source of the select events with Ada in the past and being satisfied with the reasoning, I just don't recall it at the moment.

toji avatar Feb 07 '24 19:02 toji

Looking at our code, we emit "oculus-hand", "generic-hand" and "generic-hand-select". Does VSP just emit "generic-hand"? Is Quest browser still allowed to emit "generic-hand"?

cabanier avatar Feb 07 '24 19:02 cabanier

@cabanier continuing that discussion on the PR

Manishearth avatar Feb 07 '24 20:02 Manishearth

@cabanier Yes, I can confirm that AVP only returns "generic-hand".

@toji the AVP currently to the best of my understanding does not have "persistent auxiliary inputs and a transient primary input". There is no primary input as far as I'm aware. The assumption it breaks is that there isn't any primary input source (a MUST as per the spec, at least right now).

hybridherbst avatar Feb 07 '24 22:02 hybridherbst

@Manishearth 's new PR allows for both profile to be exposed. This is matching both implementation so I'm good with that change. This will allow you to disambiguate between VSP and other browsers.

cabanier avatar Feb 07 '24 22:02 cabanier

The assumption it breaks is that there isn't any primary input source (a MUST as per the spec, at least right now).

That conflicts with my understanding of the input model from prior conversations with @AdaRoseCannon. That said, I haven't used the AVP yet and it may have been that our discussion centered around future plans that have not yet been implemented. Perhaps Ada can help clarify?

toji avatar Feb 08 '24 19:02 toji

In the initial release of visionOS there was no primary input source, visionOS 1.1 beta (now available) has transient-pointer inputs which are primary input sources.

AdaRoseCannon avatar Feb 08 '24 19:02 AdaRoseCannon

In the initial release of visionOS there was no primary input source, visionOS 1.1 beta (now available) has transient-pointer inputs which are primary input sources.

Interesting! We have some devices here that we'll update to visionOS 1.1 beta. Do you have any sample sites that work well with transient-pointer? We have it as an experimental feature and if it works well, we will enable it by default so our behavior will match.

cabanier avatar Feb 08 '24 21:02 cabanier

A THREE.js demo which works well is: https://threejs.org/examples/?q=drag#webxr_xr_dragging but don't enable hand-tracking since THREE.js demos typically only look at the first two inputs and ignore events from other inputs.

Brandon's dinosaur demo also works well, although similar caveat.

AdaRoseCannon avatar Feb 08 '24 21:02 AdaRoseCannon

I just tried it and created a recording: https://github.com/immersive-web/webxr/assets/1513308/e1247e4b-1985-4a0e-a562-51d6aeb65f06

I will see if it matches Vision Pro.

THREE.js demos typically only look at the first two inputs and ignore events from other inputs.

Are you planning on exposing more than 2 input sources? I've been thinking about doing the same since we can now track hands and controllers at the same time. I assumed this would need a new feature, or a new secondaryInputSources attribute.

cabanier avatar Feb 08 '24 21:02 cabanier

This is getting a little off topic for the thread, but would you want to expose hands and controllers as separate inputs? A single XRInputSource can have both a hand and a gamepad.

(EDIT: I guess the input profiles start to get messy if you combine them, but it still wouldn't be out-of-spec)

toji avatar Feb 08 '24 21:02 toji

I believe so because if you expose hands and a transient input source, it would be weird if the ray space of the hand suddenly jump and becomes a transient-inputsource.

cabanier avatar Feb 08 '24 21:02 cabanier

I just tried it and created a recording

Looks correct to me.

Are you planning on exposing more than 2 input sources? I've been thinking about doing the same since we can now track hands and controllers at the same time. I assumed this would need a new feature, or a new secondaryInputSources attribute.

In visionOS 1.1 if you enable hand-tracking then the transient-inputs appear after the hand-inputs as in elements 2 and 3 in the inputSources array.

I assumed this would need a new feature, or a new secondaryInputSources attribute.

We have events for new inputs being added which can be used to detect the new Inputs I personally don't believe we need another way to inform developers to expect more than two inputs.

AdaRoseCannon avatar Feb 08 '24 21:02 AdaRoseCannon

I assumed this would need a new feature, or a new secondaryInputSources attribute.

We have events for new inputs being added which can be used to detect the new Inputs I personally don't believe we need another way to inform developers to expect more than two inputs.

I was mostly concerned about broken experiences. I assume you didn't find issues in your testing?

cabanier avatar Feb 08 '24 21:02 cabanier

@AdaRoseCannon Are there any experiences that work correctly with hands and transient input? @toji Should we move this to a different issue?

cabanier avatar Feb 08 '24 21:02 cabanier

I worry that adding inputsources is confusing for authors and might break certain experiences.

Since every site needs to be updated anyway, maybe we can introduce a new attribute (secondaryInputSources?) that contains all the input sources that don't generate input events.

/agenda should we move secondary input sources to their own attribute?

cabanier avatar Feb 12 '24 15:02 cabanier

I think there are a few cases where it won't be clear which thing is "secondary" and it highly depends on the application.

Example: if Quest had a mode where both hands and controllers are tracked at the same time, there could be up to 6 active input sources:

  • 2 hands (with or without select events)
  • 2 controllers (with select events)
  • 2 transient pointers
  • of which e.g. 4 could be active simultaneously, which would still be allowed according to spec if I'm not mistaken.

I think instead of a way to see which input sources may be designated "primary" or "secondary" by the OS, it may be better to have a way to identify which input events are caused by the same physical action (e.g. "physical left hand has caused this transient pointer and that hand selection") so that application developers can decide if they want to e.g. only allow one event from the same physical source.

hybridherbst avatar Feb 12 '24 18:02 hybridherbst

I don't think it's enough to disambiguate the events. For instance, if a headset could track controllers and hands at the same time, what is the primary input?

If the user is holding the controllers, the controllers are primary and hands are second. However, if they put the controllers down, hands become the primary and controllers are now second.

WebXR allows you to inspect the gamepad or look at finger distance so we need to find a way to let authors know what the input state is. Just surfacing everything will be confusing.

cabanier avatar Feb 12 '24 19:02 cabanier

Since every site needs to be updated anyway,

Hold on, does it? I don't think we're requiring any major changes here.

Manishearth avatar Feb 12 '24 19:02 Manishearth