serenity
serenity copied to clipboard
AK+LibUnicode+Ladybird+Browser: Handle converting domains from Unicode to ASCII
This set of commits implements Punycode conversion as well as the Unicode processing for domain names of UTS 46 and hooks it up to the user URL inputs of Browser & Ladybird (command line and address bar).
The parsing of these URLs is made opt-in through Unicode::create_unicode_url because it requires linking with LibUnicode.
Ideally the normal URL parser would handle them directly, but that would require a major overhaul, most likely involving moving URL out of AK.
This, while somewhat of a hack, seems to be the least invasive solution for now.
Addressed most of the comments and left comments on the rest. Also rebased on master for the SourceGenerator changes.
Fixed errors when compiling with ENABLE_UNICODE_DATABASE_DOWNLOAD=off
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!
This has a couple of very minor conflicts.
Fixed the new conflicts again. Looks like CI breaks on linting Meta/generate-libwasm-spec-test.py which I didn't touch.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!
Resolved conflicts and made create_unicode_url return ErrorOr<URL> instead of setting the URL to invalid.
git range-diff 8e03206..ba5c6b3 ebb822d..b99ab77:
1: 9a1050a407 = 1: d8b9a7c0b8 LibUnicode: Add Punycode::decode
2: b7c19a21ed = 2: 39dc038565 LibUnicode: Add Punycode::encode
3: 4e515bc915 = 3: 625cf51787 LibUnicode: Download and parse IDNA data
4: 372a5d2b37 ! 4: 56c546e613 LibUnicode: Add IDNA::to_ascii
@@ Userland/Libraries/LibUnicode/IDNA.cpp (new)
+{
+ // 1.The label must be in Unicode Normalization Form NFC.
+ auto normalized = normalize(label, NormalizationForm::NFC);
-+ if (normalized.is_error() || normalized.release_value() != label)
++ if (normalized != label)
+ return false;
+
+ size_t position = 0;
@@ Userland/Libraries/LibUnicode/IDNA.cpp (new)
+ }
+
+ // 2. Normalize. Normalize the domain_name string to Unicode Normalization Form C.
-+ auto normalized = TRY(normalize(mapped.string_view(), NormalizationForm::NFC));
++ auto normalized = normalize(mapped.string_view(), NormalizationForm::NFC);
+
+ // 3. Break. Break the string into labels at U+002E ( . ) FULL STOP.
+ auto labels = TRY(normalized.split('.', SplitBehavior::KeepEmpty));
5: 69e3461708 ! 5: 148a79213c AK+LibUnicode: Add Unicode::create_unicode_url
@@ Commit message
This is a workaround for the fact that AK::URLParser can't call into
LibUnicode directly.
- ## AK/URL.h ##
-@@ AK/URL.h: public:
- m_paths.append("");
- }
-
-+ void set_invalid() { m_valid = false; }
-+
- DeprecatedString serialize_path() const;
- DeprecatedString serialize(ExcludeFragment = ExcludeFragment::No) const;
- DeprecatedString serialize_for_display() const;
-
## AK/URLParser.cpp ##
@@ AK/URLParser.cpp: static Optional<URL::Host> parse_host(StringView input, bool is_not_special = fa
// FIXME: 4. Let domain be the result of running UTF-8 decode without BOM on the percent-decoding of input.
@@ Userland/Libraries/LibUnicode/URL.cpp (new)
+}
+
+// https://url.spec.whatwg.org/#concept-host-parser
-+URL create_unicode_url(String const& url_string)
++ErrorOr<URL> create_unicode_url(String const& url_string)
+{
+ // NOTE: 1.-4. are implemented in URLParser::parse_host
+
@@ Userland/Libraries/LibUnicode/URL.cpp (new)
+ return url;
+
+ // 5. Let asciiDomain be the result of running domain to ASCII with domain and false.
-+ auto ascii_domain = domain_to_ascii(domain.bytes_as_string_view(), false);
+ // 6. If asciiDomain is failure, then return failure.
-+ if (ascii_domain.is_error())
-+ url.set_invalid();
-+ else
-+ url.set_host(ascii_domain.release_value());
++ auto ascii_domain = TRY(domain_to_ascii(domain.bytes_as_string_view(), false));
+
+ // FIXME: Reimplement 7. or call into URLParser::parse_host using ascii_domain (8. & 9. do not apply)
++ url.set_host(ascii_domain);
+ return url;
+}
+
@@ Userland/Libraries/LibUnicode/URL.h (new)
+
+namespace Unicode {
+
-+URL create_unicode_url(String const&);
++ErrorOr<URL> create_unicode_url(String const&);
+
+}
6: 81dd985ee9 = 6: 2c8fed8446 Ladybird: Fix ak_string_from_qstring truncation for non-ASCII strings
7: 019298ab5b < -: ---------- Ladybird: Handle navigating to Unicode domains
8: ba5c6b3133 < -: ---------- Browser: Handle navigating to Unicode domains
-: ---------- > 7: 3adf5c454c Ladybird: Handle navigating to Unicode domains
-: ---------- > 8: b99ab77e29 Browser: Handle navigating to Unicode domains
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!
Rebased on master, dropped all the per-chrome commits in favor of 26a6974
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!