site-kit-wp
site-kit-wp copied to clipboard
International Domain Names (IDNs) are not presented/handled consistently across the plugin
Bug Description
Site Kit should allow users to set up Site Kit when they have an IDN, eg. a domain with Unicode/non-ASCII characters.
We support this in theory, but it seems like our implementation (along—potentially—with some of WordPress's handling of IDNs) is inconsistent, leading to minor bugs like failing to redirect properly (#5868) and possibly failing to create an Analytics account via Site Kit.
IDNs can be presented in two different ways: either in their "original"/"raw" format, eg. türkish.com, or in Punycode eg. xn--trkish-3ya.com. While WordPress, the Site Kit service/proxy, and parts of our plugin can handle either, it seems we aren't entirely consistent about how we present them to users or how we submit the domain name/site URL to Google Services like Analytics.
For instance, here's a site with the siteurl türkish.com, connecting an Analytics account. Note the domain detected/presented is Punycode and that this fails to set up an Analytics account:
https://user-images.githubusercontent.com/90871/213430033-5c8adabb-b044-471d-9e4b-9c05f4694ea9.mp4
Here's the same site after encountering that failed set up, when the domain is manually substituted for the Unicode one:
https://user-images.githubusercontent.com/90871/213430312-24acd253-3f7e-403a-b8a1-f33a1a86df19.mp4
If we set the siteurl to a Punycode string instead of a Unicode one, note that the Punycode Analytics account creation works just fine:
https://user-images.githubusercontent.com/90871/213432130-dbec3705-9f29-4cd4-b421-ee649679a286.mp4
It's worth noting that because we treat the Unicode and Punycode domains as different, there's also this odd error users can encounter when changing between the Unicode and non-Unicode siteurl:
Arguably that isn't really true, and we should probably consistently treat all IDNs as either Unicode or Punycode in our codebase and always present them as Unicode in the UI.
Do not alter or remove anything below. The following sections will be managed by moderators only.
Acceptance criteria
- The existing
URLclass should be extended to allow creation ofURLinstances which hold a WordPress/site URL (or any other URL) created from a string of either ASCII, punycode, or Unicode characters.- This class should accept a string as (part of) its constructor, and convert the string internally to an ASCII domain. Eg. If the string is already plain ASCII, like
google.com, the string would be stored as-is. If the string is Unicode (türkish.com), it should be converted to a punycode string (xn--trkish-3ya.com). - The class should offer two methods to output the URL: "user-facing" and "internal". The user-facing one should always be the "pretty", Unicode version, regardless of whether the input string was punycode or Unicode. The "internal" version should always be the punycode version.
- This class should accept a string as (part of) its constructor, and convert the string internally to an ASCII domain. Eg. If the string is already plain ASCII, like
- Going forward, Site Kit should handle all URL strings by converting them into a
URLinstance and using the correct output method based on the requirements. In PHP code, most usage is internal and not part of user-facing UI, so this will almost always be the$url->internal()format. - In JS, we should use the punycode URLs whenever dealing with non-user-facing URLs. Because some Google APIs could return Unicode domains/URLs in their responses, we may also need to convert URLs received from APIs with a JS
normalizeURL( url )type utility that normalises the URL into a punycode version. - In JS, we should always convert URLs from punycode to Unicode format when displaying them in UIs, like showing the connected Search Console property URL:
This change of "always using punycode internally" should not pose issues for users who "registered Site Kit" when the internal URL was Unicode and not punycode, because the Site ID is the relevant link between the plugin and Site Kit service.
However, comparing the "internal", punycode domain with existing Analytics properties (or other service's properties) may cause issues where an existing Analytics property exists but for a Unicode version of our "internal" punycode domain.
- This means when comparing existing property URLs with our WordPress URL, we should compare both punycode and Unicode versions to find a match.
Implementation Brief
Test Coverage
QA Brief
Changelog entry
@eugene-manuilov and @aaemnnosttv: I wrote up some ACs here that I think cover the approach we talked about on last week's call. Can either/both of you review it and let me know if that covers everything?
Thanks @tofumatt – I think this is a great start but there may be a few more things to consider
A few thoughts:
- Some of this might need handling on the service as well (wouldn't be defined here of course but we need to consider this somewhere)
- Which encoding would we use for redirects? I assume the internal form but maybe not? What was needed in the issue we had before?
- It would be useful to expose something like a
URL->equals()method which could take a string or another URL instance to perform the equivalence check internally - I feel like we need to do some discovery about services to see if there are instances where we need to pass one form or the other (e.g. when creating a property/datastream, etc) – the video you shared was using the legacy account provisioning endpoint so maybe it wouldn't be a problem now since we always create GA4 entities the same way. We should check others like GTM though.
- For reporting, we already include both forms of the site's URLs in the filters for data that we query all http(s), www/no-www, and puny/Unicode combinations, so this part should already be handled for services that support it
@eugene-manuilov Are you still planning to review this soon? If not, can you please unassign yourself so someone else can pick up? Thank you!
I have nothing to add to Evan's feedback. Assigning back to @tofumatt to address it.