mediacapture-main
mediacapture-main copied to clipboard
Avoid circular definition of muted.
This definition is backwards: "If live samples are not made available to the MediaStreamTrack it is muted".
Mute causes lack of frames, not the other way around: If a MediaStreamTrack is muted, no live samples are made available to it.
All subsequent language and examples align with muted being an intentional User Agent initiated change:
Crucially, the "change" of state (not just the event) is initiated by the User Agent.
This has caused confusion in implementations. E.g. @youennf replied in https://github.com/w3c/mediacapture-extensions/issues/39#issuecomment-1824119912:
Thanks @guidou, this is really helpful info.
For camera tracks, Chrome just checks if frames have not been received for some time (25 expected frame intervals in most cases), regardless of the underlying reason. This maps well to the spec text that states If live samples are not made available to the MediaStreamTrack it is muted,
The spec allows it. I wonder though whether this model is actually helping web developers. For instance, is it better to have a black video or a frozen video when the track is sent via WebRTC?
In general, the value of an "event" is its intent, that something external happened. Therefore, synthesizing events reactively from symptoms seems a mistake. For example: crbug 941740 implements mute on remote tracks reactively based on (lack of) input, violating the WebRTC spec and causing web compat issues. Doing the same on capture tracks seems like a bug, and should be a violation of this spec, but is attributed to the aforementioned line in the spec.
The stats API that @henbos is working on could be more appropriate for web developers.
FWIW, in Safari, if we do not receive video frames/audio frames after a given amount of time, we fail the capture. We assume that something is wrong and that the web application be better restarting capture, maybe with a different device. Some web applications do this, not all of them sadly.
These browser differences are making developer's life difficult. I wonder whether this space is now mature enough that we could get browsers to share a more consistent model around muted and capture suspension/failure. @jan-ivar, how is Firefox using muted these days for capture tracks? Is Firefox sometimes failing capture?
Firefox fires mute
as explained in the OP of https://github.com/w3c/mediacapture-extensions/issues/39#issue-1037935336 (behind a pref) but never reactively from symptoms.
Proposal:
Replace the confusing sentence with "If a MediaStreamTrack is muted, no live samples are made available to it."
The problem with "fixing" these spec definitions that have been in place for years to try to better solve today's problems is that it is extremely difficult to update implementations that have followed the old definition for years. Practically every time Chromium has tried to do that with other similar spec changes, the changes had to be reverted because the new behavior broke existing applications. We have had much better results introducing new and better APIs and removing the old one after applications migrate to the new one (srcObject and the Plan B/Unified Plan transitions are good examples of this). This type of redefinition is also very problematic for applications that need to support new and old browser versions simultaneously (common in enterprise environments).
I would oppose any spec changes to the muted attribute and the corresponding events.
I'm OK with defining a new attribute or method with the new definition (e.g., call it isMuted) and removing the old one from the spec.
This gives time to applications to migrate to the new version over time without causing abrupt compatibility problems.
If we do that, then we can probably move the discussion to requestUnmute()
. After all, what we were proposing was adding a bool to better understand if the cause of the mute was the one we want in the new definition or one of the causes allowed by the old definition but not the new one.
@guidou, I understand the concerns. Before diving in those concerns, I understand that there is a desire from Chrome to try moving towards this specific muted definition.
About the concerns, in this particular case, the change is about stopping to fire mute events in odd cases. How do you expect it to break existing websites? I would think that some UI might not be updated with the capture-does-not-work-properly, which is not great but not too bad either. And these websites would anyway need to be updated.
As of a new attribute, would it mean new event listeners? If so, this has a very high toll, to all browsers and all websites, this seems very complex.
Given audio/video stats API will allow to simulate these odd cases mute events, would it not be possible to advertise the use of JS polyfills for applications that would like to keep receiving these events? That way, shipping audio/video stats API and muted event migration guidelines could be sufficient.
I'd like to avoid introducing a boolean which definition would, from the start, mention that this is for legacy applications and that we plan to obsolete it.
Overall, I like this proposal. I wonder whether we should mention that UAs MAY end capture because of lack of samples for unknown reasons. If we go in that direction, we probably want to update image capture to not fire onmute/onunmute events.
How do you expect it to break existing websites? I would think that some UI might not be updated with the capture-does-not-work-properly, which is not great but not too bad either. And these websites would anyway need to be updated.
Good question. Any apps treating mute as fatal would already fail to interoperate.
In general, the value of an "event" is its intent, that something external happened. Therefore, synthesizing events reactively from symptoms seems a mistake.
What can the user agent do on platforms where they get no advanced knowledge that frames will not be forthcoming?
I see great value in giving the application as much clarity as the user agent can muster. We live in a world of open source operating systems and browsers. Hundreds of millions of people use video-conferencing tools every day. The vendors of these VC applications have large engineering teams. Having clear metrics on where various issues lie, allows these engineers to set out and fix issues in codebases beyond their own - to everyone's benefit.
Specifying that user agents should not mute when they are not sure the issue is an explicit mute would be a step in the wrong direction. Having more fine-grained MuteReasons - as I had proposed elsewhere - would be a step in the right direction.
In general, the value of an "event" is its intent, that something external happened. Therefore, synthesizing events reactively from symptoms seems a mistake.
What can the user agent do on platforms where they get no advanced knowledge that frames will not be forthcoming?
We define APIs based on developer needs, not user agent needs.
If the OS mutes, the user agent owns the problem of detecting that and conveying that as an "event" that happened. E.g. If the user agent has reason to believe lack of frames is instead due to an error, then ending the track may be more appropriate.
If the user agent cannot tell whether the OS muted it or whether there was an error, that is its problem to solve. Punting hard questions like this to the webapp doesn't seem reasonable to me.
The spec already defines muted and ended as separate events for this reason. Agreeing on these definitions is what we've committed to to having browsers interoperate.
the change is about stopping to fire mute events in odd cases. How do you expect it to break existing websites?
An answer to this question would be helpful.
We define APIs based on developer needs, not user agent needs.
Developers need to be able to debug their users issues. Even if those issues extend beyond the JS application. For a developer of a VC application, the user agent, the operating system, even the hardware - everything is in scope.
If the OS mutes, the user agent owns the problem of detecting that and conveying that as an "event" that happened.
As mentioned in my previous message, the user agent might not be able to understand why they are not receiving new frames. Issuing a mute
event in such a case is both spec-compliant and useful (to developers).
If the user agent has reason to believe lack of frames is instead due to an error, then ending the track may be more appropriate.
Why frames are not arriving would not always be known.
If the user agent cannot tell whether the OS muted it or whether there was an error, that is its problem to solve.
Developers cannot afford to sit on their hands and pray that others would solve their problems. We live in a competitive world. He who solves their users problems promptly gains the prize of retaining his customers. Let's empower developers in their quest to serve our mutual users. ("Our mutual users" - shared by the browser and the Web app.)
If the user agent has reason to believe lack of frames is instead due to an error, then ending the track may be more appropriate. [...] The spec already defines muted and ended as separate events for this reason.
You gave an example where you believe ending is better than muting. Even if I agreed, for the sake of argument, that this was correct - what about all other cases? Allow me to quote my colleague Guido: "We need to solve all use cases that arise in practice, not just the simplest one."
This issue was mentioned in WebRTC December 12 2023 meeting – 12 December 2023 (Solve user agent camera/microphone double-mute (mediacapture-extensions))
This definition is backwards: "If live samples are not made available to the MediaStreamTrack it is muted".
Mute causes lack of frames, not the other way around: If a MediaStreamTrack is muted, no live samples are made available to it.
All subsequent language and examples align with muted being an intentional User Agent initiated change:
Not all subsequent language align with muted being an intentional User Agent initiated change. In fact, I would argue that no language at all aligns with this. The word intentional does not appear anywhere in the spec. The list you refer to is presented as situations that "can be" reason to mute a track. Nowhere is it stated that any element in that list is actually a reason to mute, or even that it SHOULD cause mute. At best, it can be interpreted as a MAY. More importantly, Section 4.3.1.1 of the spec says:
The muted/unmuted state of a track reflects whether the source provides any media at this moment.
A MediaStreamTrack is muted when the source is temporarily unable to provide the track with data
And Section 8 says the mute event is fired when The MediaStreamTrack object's source is temporarily unable to provide data, and the unmute event is fired when A MediaStreamTrack has been removed from this stream. Note that this event is not fired when the script directly modifies the tracks of a MediaStream.
This makes it clear that the model is that muted means no media from the source to the track, and disabled means no data from the track to its consumers.
In general, the value of an "event" is its intent, that something external happened. Therefore, synthesizing events reactively from symptoms seems a mistake.
Maybe it was a mistake that the spec defined the muted attribute and the corresponding events the way it did years ago. But, mistaken or not, that's how it was defined.
For example: crbug 941740 implements mute on remote tracks reactively based on (lack of) input, violating the WebRTC spec and causing web compat issues.
In this case, Chromium just is applying the model defined in the main spec to remote tracks. The WebRTC spec indicates some cases in which the muted attribute should be set/unset, but AFAICT it does not say anywhere that this overrides the model defined in the original MediaStreamTrack specification. It also does not state a new definition of muted specific for WebRTC tracks and does not even list the muted/unmuted events in its [Event Summary section].
Shouldn't specs that override/redefine concepts inherited from other specs explicitly state it?. Until we make this more explicit in the WebRTC spec, my position is that https://crbug.com/941740 is not a spec-compliance bug in Chromium. If anything, it looks more like a spec bug in the WebRTC spec.
Doing the same on capture tracks seems like a bug, Are you saying it seems like a spec bug or a Chromium bug? It is pretty clear to me that Chromium behavior is spec compliant.
and should be a violation of this spec, Are you saying Chromium behavior is in violation of the spec, or that the spec should be rewritten such that Chromium behavior becomes a violation of the spec?
but is attributed to the aforementioned line in the spec. Not only that line. As I showed, the concept of muted meaning no data from source to track is in many places in the spec, and is the only way muted is defined.
Firefox fires
mute
as explained in the OP of w3c/mediacapture-extensions#39 (comment) (behind a pref) but never reactively from symptoms.
Maybe Firefox's behavior is the one in violation of the spec?
Proposal:
Replace the confusing sentence with "If a MediaStreamTrack is muted, no live samples are made available to it."
Can you clarify what this sentence means? Is it a description of something that happens when a track is muted? In that case it's not a definition and it's not that different from the original, except in that it is not longer a definition. Basically, it replaces "A is defined to be B" with "A implies B".
Or is it a statement that if the UA detects a condition that should mute the track, then it should make sure the track does not receive any media?
Either way, the change is not enough, since the concept of muted meaning no data from source to track is in many other places of the spec.
Finally, I am opposed to an incompatible redefinition of the meaning of muted because experience shows that this type of change is difficult to deploy in practice and can lead to more interoperability issues.
I am not opposed to a redefinition that provides a path for existing applications to use a newer, more useful definition, without making it impossible for applications to continue relying on the old definition.
@guidou, I understand the concerns. Before diving in those concerns, I understand that there is a desire from Chrome to try moving towards this specific muted definition.
Yes, we are interested in introducing a new definition that can solve the multiple-mute problem (and even the single mute one), but in a way that doesn't break existing applications or that at least provides a path for existing applications to be easily updated to continue working.
About the concerns, in this particular case, the change is about stopping to fire mute events in odd cases. How do you expect it to break existing websites? I would think that some UI might not be updated with the capture-does-not-work-properly, which is not great but not too bad either. And these websites would anyway need to be updated.
In our experience, applications that break are the ones that are hard to think about in advance. We normally find out after rolling out the change. For example, when we implemented the requirement to wait for focus in getUserMedia() we thought nothing would break, and shortly after we started rolling out the change we received reports from some kiosk-like environments that broke because focus was impossible to obtain for those applications. We had to roll back the change.
As of a new attribute, would it mean new event listeners? If so, this has a very high toll, to all browsers and all websites, this seems very complex. Depends on how we define the new attribute. If we go with the muteReason proposal or a similar one, we don't need new event listeners. Applications might need to be updated to look at the mute reason to decide how to proceed, but they would have a path to migrate to the new API without causing permanent breakage.
Given audio/video stats API will allow to simulate these odd cases mute events, would it not be possible to advertise the use of JS polyfills for applications that would like to keep receiving these events? That way, shipping audio/video stats API and muted event migration guidelines could be sufficient.
Maybe that can be a solution. Support the old definition via stats and the new definition with muted. I'm not sure the stats spec in its current form supports this, but it's a valid possibility.
I'd like to avoid introducing a boolean which definition would, from the start, mention that this is for legacy applications and that we plan to obsolete it.
That wouldn't be ideal. It doesn't have to be the case here, though. If we are able to provide a good migration path via stats, that might work. Adding a muteReason or some other API for the new definition would also work.
In general, the value of an "event" is its intent, that something external happened. Therefore, synthesizing events reactively from symptoms seems a mistake.
What can the user agent do on platforms where they get no advanced knowledge that frames will not be forthcoming?
We define APIs based on developer needs, not user agent needs.
I don't think user agents have needs other than the ones of their users (including developers).
If the OS mutes, the user agent owns the problem of detecting that and conveying that as an "event" that happened. E.g. If the user agent has reason to believe lack of frames is instead due to an error, then ending the track may be more appropriate.
To me all this sounds a lot like synthesizing events reactively from symptoms.
The spec already defines muted and ended as separate events for this reason. Agreeing on these definitions is what we've committed to to having browsers interoperate.
Yes. Chromium implements both according to the spec. What we're discussing here is how to change the spec to solve new problems (e.g., multiple mute) in a way that doesn't introduce unsurmountable compatibility problems for existing applications.
the change is about stopping to fire mute events in odd cases. How do you expect it to break existing websites?
An answer to this question would be helpful.
Already answered in a previous message.
We define APIs so that developers can satisfy user needs for applications running on a specific UA. The UA has no needs; it exists to satisfy the user - in the case of JS apps, to let the app developers satisfy the users.
The UA and the OS are not friends. And the user has a direct relationship to both.
When an OS-level mute is applied, and can only be rectified using the user's relationship with the OS, the user needs to know that it has to act in relation to the OS.
If the OS offers API to the UA so that the UA can let the app developer satisfy the user's need (in this case: to unmute), the user's needs will be simpler to satisfy.
The difference between muted and ended in our specs is that one is reversible, the other isn't. So anything that is not based on a clear signal that the source is gone and won't come back should be "muted", not "ended". "Reason to believe" sounds like "probable cause", not "clear signal".
If we are able to provide a good migration path via stats, that might work.
video deliveredframes can be used with a timer-based approach to shim existing Chromium muting events for video tracks. Alternatively, shipped rvfc can already be used to detect that frames are not flowing. This probably makes video the easier one to migrate first.
audio deliveredframes can be used for microphone tracks, AudioWorklet might most probably expose 0 in case of missing audio frames.
This approach does not require to create new APIs and allows web applications to fine tune their own detection heuristics.
@guidou, do you think this migration path would work?
That's not a migration path, that's a redefinition. It proves that there's nothing preventing other browsers from emulating Chrome's behavior, even if you want to do it in a shim. What possible advantage would there be to Chrome in departing from the existing behavior, which is consistent with the current definition?
For example: crbug 941740 implements mute on remote tracks reactively based on (lack of) input, violating the WebRTC spec and causing web compat issues.
In this case, Chromium just is applying the model defined in the main spec to remote tracks. The WebRTC spec indicates some cases in which the muted attribute should be set/unset, but AFAICT it does not say anywhere that this overrides the model defined in the original MediaStreamTrack specification. It also does not state a new definition of muted specific for WebRTC tracks and does not even list the muted/unmuted events in its [Event Summary section].
This seems wrong. I've filed https://github.com/w3c/webrtc-pc/issues/2915 on this. Let's discuss that there.
I think I see now how we came to have this vague language. MediaCapture-main is trying to establish both a model for all sources, while simultaneously specifying camera and microphone sources explicitly. I think it needs to do a better job separating when it's doing one or the other.
At its core, I think most people consider muting to be a conscious action based on intent. A reason, not a reaction.
At its core, I think most people consider muting to be a conscious action based on intent. A reason, not a reaction.
Correct me if I am wrong, but at the time that mute
was specified, I believe no user agent allowed users to mute the mic/camera, nor did any OS. What conscious action were Web apps intended to discover? By whom? How was this actionable to such Web apps?
It's in the OP: "There can be several reasons for a MediaStreamTrack to be muted: the user pushing a physical mute button on the microphone, the user closing a laptop lid with an embedded camera, the user toggling a control in the operating system, the user clicking a mute button in the User Agent chrome, the User Agent (on behalf of the user) mutes, etc."
The "etc." refers to other "reasons" ... "the User Agent initiates such a change", including "access may get stolen ... in case of an incoming phone call on mobile OS".
I dunno when Safari implemented its pause, but I think it was fairly early? But I don't understand why it matters since it's common and desirable for specs to exist before implementations. Specs define implementations.
When I said "most people" I meant outside of WebRTC. Muting is a verb, a function.
An answer to this question would be helpful.
Already answered in a previous message.
Could you link to it please? This issue is getting long. Please give an example of an application relying on Chrome's behavior and what action it takes. E.g. is it showing the user a message that "things are broken and no-one can hear you, please wait, maybe"?
An answer to this question would be helpful.
Already answered in a previous message.
Could you link to it please? This issue is getting long. Please give an example of an application relying on Chrome's behavior and what action it takes. E.g. is it showing the user a message that "things are broken and no-one can hear you, please wait, maybe"?
The answer is that, in our experience, applications that break with this type of change are the ones that are hard to think about in advance. We normally find out after rolling out the change. For example, when we implemented the requirement to wait for focus in getUserMedia() we thought nothing would break, and shortly after we started rolling out the change we received reports from some kiosk-like environments that broke because focus was impossible to obtain for those applications.
IMO, the bar for changing a definition that has been in place for years both in spec and implementations should be very high, even if the proposed change is obviously better.
IMO, the bar for changing a definition that has been in place for years both in spec and implementations should be very high, even if the proposed change is obviously better.
I agree. And even if we could come to an agreement - it does not appear to come quick nor easy. Now, @jan-ivar has recently posted something I wholeheartedly agree with:
Web developers should not suffer while vendors reach agreement.
In the spirit of these wise words, I propose we now proceed with one of the backwards-compatible proposals currently under discussions, such as MuteReason or MediaSession. (Full disclosure - I have a strong preference for the former.)
even if the proposed change is obviously better.
It seems we all agree this definition would be better. It would make sense to work towards getting all implementations aligned on that definition.
A path forward has been described, via a shim of current Chrome behaviour. This seems a practical approach to me. If not, I'd like to understand why.
such as MuteReason
This would solidify a model of muted being open ended and loosely defined. A dedicated event based API for each cause where mute might be useful would lead to better interop/convenience to web developers.
MediaSession
We need to make MediaSession and MediaStreamTrack consistent, let's do that whatever we decide here.
This would solidify a model of muted being open ended and loosely defined.
Even with the proposal here "mute" would still cover both OS-based and UA-based muting. Letting the Web app know which it is does not make it open ended or loosely defined. Carving out an "unspecified" for hardware, issues, or anything we might not be thinking of, does not solidify the model; later migration would be equally challenging then as it is now.
A dedicated event based API for each cause where mute might be useful would lead to better interop/convenience to web developers.
I am not opposed to dedicated events, but they seem to be less elegant a solution, given the possibility of multiple concurrent mutes. Conversely, a single mute state with multiple reasons, allows observing the transition from empty set to non-empty set, which is great for apps that only care about that.
It seems to me that even with the greatest selection of mute reasons imaginable, there is likely to be the case of "this source is producing silence and I don't know why". I think that's a reasonable description of the cases where Chrome currently mutes and other browsers have not chosen to mute.
Note: I'm unclear about whether Chrome fires mute events on "no signal" in audio. If we do, I think the signal Chrome is reacting to on audio is digital silence (all zeroes), which is different from "no speech detected" - there's always some noise in real audio.
"this source is producing silence and I don't know why"
It is hard to make progress without precisely knowing how/when Chrome is firing mute events on capture tracks. I understand that Chrome's intent is to currently use mute to notify web applications that a capture track is potentially malfunctioning. Is that correct?
That seems valuable information to provide to the web page. AIUI, this is one of MediaStreamTrack stats goal, though a dedicated API might make web developers life easier.
For video, MediaStreamTrack stats is hopefully sufficient to detect these malfunctioning cases. For audio, it is unclear whether MediaStreamTrack stats is enough, maybe this should get fixed.
For video, MediaStreamTrack stats is hopefully sufficient to detect these malfunctioning cases. For audio, it is unclear whether MediaStreamTrack stats is enough, maybe this should get fixed.
When the mute
event is fired and the app observes it and turns to handle it, what stats are available to it that would definitively, non-heuristically inform it that the track is muted due to an upstream entity such as the OS or UA?
For video, MediaStreamTrack stats is hopefully sufficient to detect these malfunctioning cases. For audio, it is unclear whether MediaStreamTrack stats is enough, maybe this should get fixed.
When the
mute
event is fired and the app observes it and turns to handle it, what stats are available to it that would definitively, non-heuristically inform it that the track is muted due to an upstream entity such as the OS or UA?
What is "an upstream entity such as the OS or UA" distinct from, when all muting is "UA" by definition? This seems to be the definition problem we're having.
Turning the question around:
When the mute
event is fired in Chrome and the app observes it and turns to handle it, what stats are available to it that would inform it that the track is malfunctioning?
In other browsers, apps could detect this (e.g. using stats once implemented):
- lack of frames in stats + !track.muted = malfunction
In Chromium, apps cannot, because Chromium circularly masks the symptom, making malfunction indistinguishable from "OS or UA" mute.
This problem seems unique to Chromium, as does the need for a new mute-reason API to resolve it.
To make progress, I think we should leave the UA vs. OS muting discussion out of this particular issue. This can be resolved orthogonally to this discussion.
The proposal is something like:
- UAs refrain from firing mute events in malfunction cases.
- We design a shim based on MediaStreamTrack stats that emulates malfunctioning mute events.
- If MediaStreamTrack stats is not sufficient for 2, we augment the API surface (in MediaStreamTrack stats or elsewhere, new event e.g.).
- If the track is UA-muted, there is no stats, but there is no need to know whether malfunctioning or not.
Other than requiring changes in UAs, I do not see any drawback. Am I missing something?
To make progress, I think we should leave the UA vs. OS muting discussion out of this particular issue. This can be resolved orthogonally to this discussion.
The proposal is something like:
- UAs refrain from firing mute events in malfunction cases.
I don't see the justification for this. It draws a distinction between "malfunction" and "non-malfunction" that seems unwarranted and unenforceable (if an user unplugs the camera, it's a mute event; if the cat bites off the camera cable, it's a malfunction????)
- We design a shim based on MediaStreamTrack stats that emulates malfunctioning mute events.
Since 1 is unjustified, 2 is unreasonable. Also, shims don't belong in the spec. If Firefox or Safari wish to emulate Chrome's behavior, they're free to incorporate a shim of that nature, but I don't see a point in changing Chrome's behavior.
- If MediaStreamTrack stats is not sufficient for 2, we augment the API surface (in MediaStreamTrack stats or elsewhere, new event e.g.).
- If the track is UA-muted, there is no stats, but there is no need to know whether malfunctioning or not.
Other than requiring changes in UAs, I do not see any drawback. Am I missing something?
Since I don't see the point of the change, I don't see any advantage in making it.
Muted is outside the control of web applications, but can be observed [... reasons why mute can happen]. The User Agent SHOULD provide this information to the web app through muted and its associated events.
Whenever the User Agent initiated such a change, [...]
When the referenced text says the UA "initiates such a change", I believe it is referring to the steps to mute the MediaStreamTrack JS object which only the UA can modify, i.e. the steps to make the muting visible to the web app. Do read the previous sentence about UA should expose this information to the app. Also read all the examples, they're full of things that happened that was not "initiated by the UA" (laptop lid closing, incoming phone call, etc). The only thing initiated by the UA is firing the event, it is reactive, not proactive.
Replace the confusing sentence with "If a MediaStreamTrack is muted, no live samples are made available to it."
This does not make it less confusing. It begs the question: why is it muted? Even under this definition, my reading is still that the UA should detect mute on a higher layer - including reasons of malfunction, the "etc" is really a catch-all - and then initiate the exposure of the mute event. My understanding is Chromium is spec-compliant both with and without this sentence changed.
In other words, today mute means "I'm not getting any frames despite the track being enabled". This makes sense to know whether or not you care about the reason. And because we haven't exposed the reason yet, people haven't been allowed to care about why yet. So from a web developer POV, the use case this solves is still valid and it is backwards compatible not to change it.
If we add the reason, then apps that do care about why have enough information to make the distinction, solving both the use case of caring and the use case of not caring, without causing backwards compat issues.
Finally let's ask yourselves, what value does it bring to developers to pretend a malfunctioning track is not mute?