url icon indicating copy to clipboard operation
url copied to clipboard

Public Suffix API

Open eligrey opened this issue 4 years ago • 14 comments

I propose that eTLD+1 should be exposed to web content through location‍.site and new URL('…').site.

Webapps shouldn't have to embed the Public Suffix List (currently 222KB!) to lookup eTLD+1 when our browsers already have a copy.

API examples:

location.href === 'https://github.com/whatwg/url/issues/528'
location.site === 'github.com'

new URL('https://subdomain.google.co.uk').site === 'google.co.uk'

eligrey avatar Jun 24 '20 22:06 eligrey

I agree, Webapps shouldn’t embed it today. https://github.com/sleevi/psl-problems lists several reasons. In the context of apps, they have zero guarantees that the copy they embed matches how the browser is processing things, and there’s zero guarantee across browsers.

Why are they embedding today?

sleevi avatar Jun 25 '20 00:06 sleevi

We certainly have the public suffix list and use it for a growing number of things inside WebKit. I do think we should understand the intended use case more.

achristensen07 avatar Jun 25 '20 02:06 achristensen07

UI visibility/accessibility use-case

One obvious use-case is already present in browser address bars: highlighting the eTLD+1 of a domain. Pages could embolden the eTLD+1 of shown domains so that users can more easily see the important part of the domain.

Hostname matching use-case

eTLD+1-aware code can match hostnames with much simpler code.

CSP generation optimization use-case

I'm working on a synchronous blocking JavaScript library that regulates client-side network emissions and generates dynamic Content Security Policies based on user tracking consent. The library that I'm working on is currently not eTLD+1-aware due to the PSL size requirement. If it could be eTLD+1-aware, I could generate more-concise CSPs.

eligrey avatar Jun 25 '20 05:06 eligrey

Could you explain more what is meant by iterative hostname matching?

sleevi avatar Jun 25 '20 05:06 sleevi

I edited the hostname matching use case to make more sense. The UI visibility use-case seems like the most compelling reason for this to be exposed to web content in my opinion.

eligrey avatar Jun 25 '20 05:06 eligrey

I’m still not sure where or why the hostname matching is relevant. I understand the intent, but it’s not clear to me the use case or why it would be relevant or necessary for the user agent to expose.

The UI case is, oddly, the least compelling I think. That’s an area where things are still very fluid and there isn’t a well-paved path, so carrying things in JS seems reasonable.

It might be useful if you could point to some of the existing well-lit paths in libraries and how/why they expose this? In general, code shouldn’t be relying on the PSL or eTLD+1, as mentioned previously, so it’s useful to better understand the use cases to find designs that work.

sleevi avatar Jun 25 '20 05:06 sleevi

code shouldn’t be relying on the PSL or eTLD+1

In world where we're using SiteForCookies for numerous things (like SameSite cookies) and things as obscure as whether credential prompts may show for cross-site-images (https://textslashplain.com/2020/08/17/seamless-single-sign-on/) it would be super useful to be able to reliably understand the browser's understanding of what constitutes the registrable domain portion of an origin.

For instance, today, given https://deeply.nested.fuzzle.bunnies.io/, how would you as a browser expert go about determining what the current Chrome Beta build thinks is the registrable domain? Even using the Devtools, I have no idea how to do it short of creating convoluted test cases that probe for whether cookies can be created and whatnot.

ericlaw1979 avatar Aug 18 '20 21:08 ericlaw1979

That's also increasingly the case for privacy features.

cc @englehardt @johnwilander

annevk avatar Aug 19 '20 02:08 annevk

This would likely help with captured tracker data quota scoping in a client-side privacy & security framework that blocks & captures tracker requests for later replay with consent.

Specifically, I want to implement a general purpose quota system which is keyed by eTLD+1 for such a client-side framework, and this framework needs to load synchronously to provide its benefits so I wouldn't want to include a >222KB list that is already in-memory.

eligrey avatar Aug 19 '20 03:08 eligrey

For instance, today, given https://deeply.nested.fuzzle.bunnies.io/, how would you as a browser expert go about determining what the current Chrome Beta build thinks is the registrable domain? Even using the Devtools, I have no idea how to do it short of creating convoluted test cases that probe for whether cookies can be created and whatnot.

Why would you need to, though? Beyond the academic interest, it seems useful to work out the use cases that explain the why.

For example, the fact that you’ll get different answers from different user agents and versions means that, by necessity, you need to design a path for where what browser tells you isn’t what you “want” (e.g. you’re not on the list).

My concern is that “specifying” a feature that by necessity returns different answers for different versions and different browsers. At best, it makes developers lives hell; we should, as browser engineers, try to avoid that by understanding the use cases. However, in the worst case, it begins ossifying everything by creating compat issues from changing the PSL and/or discouraging its use.

sleevi avatar Aug 19 '20 03:08 sleevi

Why would you need to, though?

Well, my personal immediate need was responding to an enterprise who asked "Hey, we have a page at deeply.nested.fuzzle.bunnies.io and we'd like to understand where we have to host a subresource such that it's not considered Same Site to us. Help." To answer that question, I could go manually look through the current public WebPageTest list and then for completeness go look through all of the changelog to see whether it might've possibly changed between the browser build they are using and the current version. That's super-cumbersome.

Another possible use case (which I suspect you might not be happy about) is the best practice documented in Guidelines for URLDisplay, which calls for ensuring that the Registrable Domain portion of a URL is always visible. For a web application to implement that Best Practice requires knowing the registrable domain.

ericlaw1979 avatar Aug 19 '20 15:08 ericlaw1979

To answer that question, I could go manually look through the current public WebPageTest list and then for completeness go look through all of the changelog to see whether it might've possibly changed between the browser build they are using and the current version. That's super-cumbersome.

But exposing a Web API here doesn't change that, which is the point. The fact that there are significantly different cadences and release cycles means that any answer to that question is contingent upon browser and version and may change.

That is, if you had an API, you'd still have to test in every browser you care about to see if you got the answer you wanted.

Worse, if the website was not happy with the answer, and wanted to change the PSL, they're still going to have to deal with all of that. And none of it may affect other users and clients (e.g. system web views, command-line clients and tools, libraries that interact with the browser cookie store on mobile devices, etc)

That's why I think it's a bit of a false equivalence; a Web API could make this marginally easier, but that margin is within the noise threshold of an already overly complex space, and so that saving doesn't amount to much, but makes it even harder to reign that in.

Another possible use case (which I suspect you might not be happy about) is the best practice documented in Guidelines for URLDisplay, which calls for ensuring that the Registrable Domain portion of a URL is always visible. For a web application to implement that Best Practice requires knowing the registrable domain.

Yes, you're right that I think implementing something like Google Search's AMP URL Bar would be actively detrimental to the Web Platform ;) I'm not necessarily sure that primitives and design space specific to a browser trusted UI surface necessarily generalize onto other surfaces, especially web content controlled. This is where a more complete use case would be useful to evaluate, because I don't know that we can or should try to generalize those principles to all display surfaces.

Further, that display also doesn't require a Web API, because it's not fundamentally necessary to keep it in lock-step with a given UA. For that, we can see there are already viable alternatives: If you want to do it in client side, you can ship a public suffix list you control, or you can do it server side.

I realize these are highlighting existing alternatives, and that's perhaps dissatisfying, but this is why it's important to figure out why does it matter what a particular UA is using. At best, it creates another area for API surfaces to drift between UAs and provide different experiences. While this is fundamental to the use of a PSL, it's part of why moving away from the PSL is so fundamentally important. Hence, my concern about adding new dependencies, especially dependencies that, once Web exposed, become part of a mostly-unbreakable API contract, for something we know is fundamentally flawed and irredeemable.

sleevi avatar Aug 19 '20 15:08 sleevi

This is something we could use at Surfly. Here's our use case:

At Surfly, we are building an application-level proxy. Part of the sandbox system is URL encoding. We have an encoding scheme that maps arbitrary URLs to URLs within our own domain. At the same time, we want to make use of same-origin policy, and replicate the cookie behaviour. Since cookie scopes depend heavily on the site (that is, eTLD+1), we need to know whether a specific cookie will be processed by the browser, or ignored.

For example, if an app does document.cookie = 'ignore=this; domain=co.uk', the cookie will not be set. While translating this into the sandbox domain, we need to detect this.

muodov avatar Oct 13 '21 19:10 muodov

At Discord we are interested in following best practices when displaying URLs to users to ensure they can understand the implications of a navigation and decide with confidence. eTLD+1 falls into these best practices, but it is not practical to ship the entire public suffix list in our web app.

Also I'm not entirely sure that an API on URL is exactly what we would want in this case. A separate API specifically designed for displaying URLs may be more suitable (maybe something like a URL segmenter?). Either way, eTLD+1 functionality would be great to have.

devsnek avatar Mar 31 '23 16:03 devsnek