serenity icon indicating copy to clipboard operation
serenity copied to clipboard

AK+LibUnicode+Ladybird+Browser: Handle converting domains from Unicode to ASCII

Open skyrising opened this issue 2 years ago • 4 comments

This set of commits implements Punycode conversion as well as the Unicode processing for domain names of UTS 46 and hooks it up to the user URL inputs of Browser & Ladybird (command line and address bar).

The parsing of these URLs is made opt-in through Unicode::create_unicode_url because it requires linking with LibUnicode. Ideally the normal URL parser would handle them directly, but that would require a major overhaul, most likely involving moving URL out of AK. This, while somewhat of a hack, seems to be the least invasive solution for now.

skyrising avatar Jun 15 '23 23:06 skyrising

Addressed most of the comments and left comments on the rest. Also rebased on master for the SourceGenerator changes.

skyrising avatar Jun 19 '23 18:06 skyrising

Fixed errors when compiling with ENABLE_UNICODE_DATABASE_DOWNLOAD=off

skyrising avatar Jun 19 '23 19:06 skyrising

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

stale[bot] avatar Jul 10 '23 22:07 stale[bot]

This has a couple of very minor conflicts.

AtkinsSJ avatar Jul 11 '23 13:07 AtkinsSJ

Fixed the new conflicts again. Looks like CI breaks on linting Meta/generate-libwasm-spec-test.py which I didn't touch.

skyrising avatar Jul 29 '23 19:07 skyrising

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

stale[bot] avatar Sep 21 '23 21:09 stale[bot]

Resolved conflicts and made create_unicode_url return ErrorOr<URL> instead of setting the URL to invalid.

git range-diff 8e03206..ba5c6b3 ebb822d..b99ab77:

1:  9a1050a407 = 1:  d8b9a7c0b8 LibUnicode: Add Punycode::decode
2:  b7c19a21ed = 2:  39dc038565 LibUnicode: Add Punycode::encode
3:  4e515bc915 = 3:  625cf51787 LibUnicode: Download and parse IDNA data
4:  372a5d2b37 ! 4:  56c546e613 LibUnicode: Add IDNA::to_ascii
    @@ Userland/Libraries/LibUnicode/IDNA.cpp (new)
     +{
     +    // 1.The label must be in Unicode Normalization Form NFC.
     +    auto normalized = normalize(label, NormalizationForm::NFC);
    -+    if (normalized.is_error() || normalized.release_value() != label)
    ++    if (normalized != label)
     +        return false;
     +
     +    size_t position = 0;
    @@ Userland/Libraries/LibUnicode/IDNA.cpp (new)
     +    }
     +
     +    // 2. Normalize. Normalize the domain_name string to Unicode Normalization Form C.
    -+    auto normalized = TRY(normalize(mapped.string_view(), NormalizationForm::NFC));
    ++    auto normalized = normalize(mapped.string_view(), NormalizationForm::NFC);
     +
     +    // 3. Break. Break the string into labels at U+002E ( . ) FULL STOP.
     +    auto labels = TRY(normalized.split('.', SplitBehavior::KeepEmpty));
5:  69e3461708 ! 5:  148a79213c AK+LibUnicode: Add Unicode::create_unicode_url
    @@ Commit message
         This is a workaround for the fact that AK::URLParser can't call into
         LibUnicode directly.
     
    - ## AK/URL.h ##
    -@@ AK/URL.h: public:
    -         m_paths.append("");
    -     }
    - 
    -+    void set_invalid() { m_valid = false; }
    -+
    -     DeprecatedString serialize_path() const;
    -     DeprecatedString serialize(ExcludeFragment = ExcludeFragment::No) const;
    -     DeprecatedString serialize_for_display() const;
    -
      ## AK/URLParser.cpp ##
     @@ AK/URLParser.cpp: static Optional<URL::Host> parse_host(StringView input, bool is_not_special = fa
          // FIXME: 4. Let domain be the result of running UTF-8 decode without BOM on the percent-decoding of input.
    @@ Userland/Libraries/LibUnicode/URL.cpp (new)
     +}
     +
     +// https://url.spec.whatwg.org/#concept-host-parser
    -+URL create_unicode_url(String const& url_string)
    ++ErrorOr<URL> create_unicode_url(String const& url_string)
     +{
     +    // NOTE: 1.-4. are implemented in URLParser::parse_host
     +
    @@ Userland/Libraries/LibUnicode/URL.cpp (new)
     +        return url;
     +
     +    // 5. Let asciiDomain be the result of running domain to ASCII with domain and false.
    -+    auto ascii_domain = domain_to_ascii(domain.bytes_as_string_view(), false);
     +    // 6. If asciiDomain is failure, then return failure.
    -+    if (ascii_domain.is_error())
    -+        url.set_invalid();
    -+    else
    -+        url.set_host(ascii_domain.release_value());
    ++    auto ascii_domain = TRY(domain_to_ascii(domain.bytes_as_string_view(), false));
     +
     +    // FIXME: Reimplement 7. or call into URLParser::parse_host using ascii_domain (8. & 9. do not apply)
    ++    url.set_host(ascii_domain);
     +    return url;
     +}
     +
    @@ Userland/Libraries/LibUnicode/URL.h (new)
     +
     +namespace Unicode {
     +
    -+URL create_unicode_url(String const&);
    ++ErrorOr<URL> create_unicode_url(String const&);
     +
     +}
6:  81dd985ee9 = 6:  2c8fed8446 Ladybird: Fix ak_string_from_qstring truncation for non-ASCII strings
7:  019298ab5b < -:  ---------- Ladybird: Handle navigating to Unicode domains
8:  ba5c6b3133 < -:  ---------- Browser: Handle navigating to Unicode domains
-:  ---------- > 7:  3adf5c454c Ladybird: Handle navigating to Unicode domains
-:  ---------- > 8:  b99ab77e29 Browser: Handle navigating to Unicode domains

skyrising avatar Sep 27 '23 22:09 skyrising

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

stale[bot] avatar Oct 19 '23 05:10 stale[bot]

Rebased on master, dropped all the per-chrome commits in favor of 26a6974

skyrising avatar Oct 21 '23 19:10 skyrising

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

stale[bot] avatar Dec 09 '23 06:12 stale[bot]