Proposal: Element 'sensitive' attribute
As Artificial Intelligence (AI) becomes more prevalent in web interactions, there's a growing need to safeguard user privacy. One area of concern is the potential for AI to ingest sensitive data from HTML content. Each entity that attempts to do this will need to have a set of heuristics to determine whether the information is appropriate for AI consumption.
As a layer of protection in addition to heuristics, we're proposing to introduce a new Element attribute, tentatively called sensitive, for now being a booleanish value. If true, then it indicates that the subtree or the content of this element is sensitive or confidential information that should not be consumed by an AI model.
The idea would be that sites can annotate their pages with this attribute to "hide" information from automatic processes. This would be used by sites that present banking or medical information, as an example.
It is also interesting to explore if new values for the sensitive attribute could mean that information can be consumed with user's explicit permission, in cases where the user wants the AI model to process sensitive information to arrive at some result. For now, however, the proposal is to treat sensitive as an indication that the content should not be ingested by AI.
How would we avoid this becoming as useless as autocomplete=off?
I think you're asking what prevents AI features that consume web content from simply ignoring this attribute, did I understand correctly?
This is in a similar vein as robots.txt in that it can be ignored, but for a well meaning implementation that tries to do "the right thing", it's a useful hint about the sensitivity of the contents.
When an AI model consumes web contents, and then communicates with some other unrelated entity for example, the sensitive information leak from the web contents to the entity is frequently unintentional by the AI model/tool author. This property would provide a strong hint that the contents should be avoided.
I admit that this is a bit of a cart-before-the-horse in that the prevalence of tools that consume web contents in an AI context is currently low. It is also not necessarily something that existing browser implementations need to worry about. However, if AI features are going to be added to browsers, then this may become more important.
My concern is more about web developers slapping this on everything and thus eventually causing it to be ignored by tooling as it's noise (that's roughly what happened with autocomplete=off).
This is in a similar vein as robots.txt in that it can be ignored, but for a well meaning implementation that tries to do "the right thing", it's a useful hint about the sensitivity of the contents.
the field has already demonstrated they largely do not care for the intentions of web authors
I assume bots would be made that specifically gather content that is marked as sensitive so that it can be more efficiently exploited.
As for telling AI not to consume data ... I would appreciate, if at least benevolent players would agree to respect a simple, general value such as noai (suggested by DeviantArt) in the <meta name='robots' content='...'> tag and X-Robots-Tag HTTP header. Is there any hope for this to happen?
I assume bots would be made that specifically gather content that is marked as sensitive so that it can be more efficiently exploited.
That's a fair point. I'm mostly worried about sensitive data in the sense of logged-in personally identifiable information, not data that is public yet sensitive which is what bots would be able to access.
As for telling AI not to consume data ... I would appreciate, if at least benevolent players would agree to respect a simple, general value such as
noai(suggested by DeviantArt) in the<meta name='robots' content='...'>tag and X-Robots-Tag HTTP header. Is there any hope for this to happen?
I suspect that, naming aside, the attribute proposed here is a more granular version of the noai header. That is, there seem to be reasonable cases where annotating certain sections as noai in preferable to the whole page being noai.
I want to mention a distinction between cases where data is used to train a model -- where noai should avoid the whole page -- and cases where data is used by a trained model to answer some user prompt. In the latter, masking some data as sensitive but not the whole page seems more appropriate (and is also the reason why the initial proposal was sensitive and not noai)
In general, there's a problem of incentivizing authors to use, but not overuse, a property like this, which is hard. There's also a need for authors to somehow help "benevolent" AI players to make right decisions. This proposal is meant to at least start that conversation. More concretely, do we take the stance that since this is a hint that can be overused/abused, we shouldn't attempt to standardize anything here? I sympathize with autocomplete=off example and I'm wondering if there is a solution here that would work better
FWIW, during TPAC the Screen Capture Community Group also explored the idea of adding a "sensitive" attribute for the purpose of capture warning/prevention to protect user privacy:
The idea is similar, if the page/element contains sensitive information (banking account, medical record), then user may want to think twice before capturing and sharing the page.
As long as there is no way for the end user to ascertain whether the website is holding this attribute correctly, I don't think this is solvable in this way.
It seems like this embedded signal should be coordinated with the page-wide and site-wide signals being contemplated after https://www.ietf.org/archive/id/draft-iab-ai-control-report-00.html.
Edit: there's now an IETF working group chartered to develop these signals at https://datatracker.ietf.org/wg/aipref/about/.
The argument that sites may misuse this API is, I think, the same argument for any powerful API: notifications, fullscreen, storage, autoplaying audio, even things we don't mediate well today like focusing an input field from within an iframe and having unrestricted access to network bandwidth or the CPU. I think we've gone through this pattern enough times now that we know how to manage it better than simply refusing to provide an API that lets developers state their preferences. Its much more nuanced than a simple binary "should developers have this power or not".
It's the user agents job to take as input both the developer desires (play audio, show a notification) and the use desires (don't annoy me, let me accomplish my tasks) and mediate effectively in the interest of the users. Some users really do want some sites to play audio when they load even if they don't want it on most sites. So IMHO developers should have some standardized way to express their desire, and user agents should do their job in intersecting that with the user's desire however they find is best (details outside the scope of standards since it may be a constant arms race and area of competition between browsers). For an example of how Chrome has successfully managed this tension see autoplay policy.
I don't think it's directly comparable at all.
It's quite clear what the scope of granting location access to a website is.
But here you are using something else with the website as an input (or potentially a lot of websites as an input). I suppose you might be able to argue that the end user trusts the websites, but that alone is not sufficient. They need to trust the website and trust that the website has guarded their privacy adequately in the face of this new utility. The website could maybe signal this with an I-Have-Done-The-Right-Thing header so the utility also knows they have done the right thing. But then all websites would have to opt-in to this brave new world and that's probably not what the utility maker actually wants. So instead the utility maker should probably find a way to preserve end user privacy regardless of the trust in the website.
The argument that sites may misuse this API is, I think, the same argument for any powerful API: notifications, fullscreen, storage, autoplaying audio, even things we don't mediate well today like focusing an input field from within an iframe and having unrestricted access to network bandwidth or the CPU. I think we've gone through this pattern enough times now that we know how to manage it better than simply refusing to provide an API that lets developers state their preferences. Its much more nuanced than a simple binary "should developers have this power or not".
The issue I see in your examples is whether a website that a user visits "abuses" its power against the user's interest and how to deal with it. That is not the issue I see with the suggested sensitive attribute. The issues I see with it are:
- Will enough AI applications actually respect the attribute? The attribute is useless, if they don't. As an analogy, the Do Not Track header doesn't seem to have been widely respected and has since been deprecated. The newer Global Privacy Control is trying again. It seeks to become legally binding in more jurisdictions (the claim is that it is binding in Colorado and California), so that users could sue when their preference is ignored. GPC would fail for the same reasons as DNT without legal backing.
- Might the attribute be used to gather sensitive data more efficiently? In that case it would not just be useless, but harmful. As a mild analogy, DNT headers add another bit of information that trackers can use for browser fingerprinting, i.e. to track users more effectively.
Websites "abusing" this attribute by putting it everywhere isn't a threat for users in itself. It's not a case of too much power for developers. It's just something that would be a motivating factor for AI applications to ignore the attribute (i.e. to it being useless) and to apply their own heuristics instead, if they care about the protection of sensitive data at all.
the utility maker should probably find a way to preserve end user privacy regardless of the trust in the website.
IMHO sensitive isn't a mechanism to guard the user from the AI provider. We can't assume all websites will correctly annotate all possibly sensitive data; users will have to have some level of trust in their AI provider.
However, some sites serve data that comes with contractual/legal/compliance/etc. obligations about how that data is handled. In those cases, there should be a way to signal that this data requires explicit user consent to share with a third party (an AI provider, translation service, backup, etc). The providers of this data are already incentivized to take reasonable steps to protect it.
For example, lots of jurisdictions have rules around health records. If a user is viewing their health data, they can copy/paste that data to send in an e-mail or do a web search with it. But if they ask an AI agent about terms on the page it's not clear whether they actually intend to send the data to a third party. If the data is marked sensitive the tool can ask for explicit user permission before sharing the data.
Will enough AI applications actually respect the attribute? The attribute is useless, if they don't.
There will be good actors and bad actors. We should make it possible/easy for people wanting to do the right thing. IMHO, the incentives on the tool makers here are very different to DNT.
And to avoid miscommunication: we're still very much in the exploration phase. We don't currently have a concrete proposal nor are we convinced any of the above is the right/correct solution.
My understanding is similar to what @bokand listed above. This attribute gives the user agent a strong hint that user permission should be obtained before sharing the data externally, e.g. with AI service. This also aligns with the capture prevention use case discussed in https://github.com/whatwg/html/issues/10519#issuecomment-2405502296.