html Improve handling of malformed unicode bidi control characters

What is the issue with the HTML Standard?

Discussed this with @fantasai

TL;DR - Could <bdo>, <bdi>, and possibly dir be updated to deal with unbalanced/malformed Unicode bidi control characters to avoid them affecting their surrounding text?

When accepting user input, users will often paste text from other programs that sometimes contains malformed (unclosed, unopened) unicode bidi control characters. When you use this user input in HTML, wrapping them with <bdi>, <bdo>, and/or dir is not enough to avoid it breaking out of the supposed 'isolation', and it can wreak havoc on the surrounding text.

<!-- &#x202E; = 'RIGHT TO LEFT OVERRIDE' (unclosed) -->
<!-- Maybe pasted from another program (or maybe just malicious) -->
<input value="&#x202E;Jamie">

In the simplest case, the text is appears as expected:

<p>Hey <bdi>&#x202E;Jamie</bdi>, welcome!</p>
<!-- Rendered: Hey eimaJ, welcome! -->

But if we pop the directional isolate, you can break out of the <bdi>

<!-- &#x2069; = 'POP DIRECTIONAL ISOLATE' (unopened) -->
<p>Hey <bdi>&#x2069;&#x202E;Jamie</bdi>, welcome!</p>
<!-- Rendered: Hey !emoclew ,eimaJ -->

You could of course wrap it with another <bdo> and it will be fixed again:

<p>Hey <bdi><bdi>&#x2069;&#x202E;Jamie</bdi></bdi>, welcome!</p>
<!-- Rendered: Hey eimaJ, welcome! -->

What's missing is a balancing of these control characters. In order to combat this we've written a function to close fix-up these strings before they end up in the HTML.

// Wraps with 'FIRST STRONG ISOLATE' + text + 'POP_DIRECTIONAL_ISOLATE'
bidiIsolate('hello') // '\u2068hello\u2069'

// Closes unclosed isolates/overrides/formatting:
bidiIsolate('\u202Ahello') // '\u2068\u202Ahello\u202C\u2069'

// Strips out unopened 'POP' control chars:
bidiIsolate('\u2069hello') // '\u2068hello\u2069'

Source: https://github.com/signalapp/Signal-Desktop/blob/ce0fb220411b97722e1e080c14faa65d23165784/ts/util/unicodeBidi.ts#L124

See Code

// Copyright 2024 Signal Messenger, LLC
// SPDX-License-Identifier: AGPL-3.0-only

export function _bidiIsolate(text: string): string {
  // Wrap with with first strong isolate so directional characters appear
  // correctly.
  return (
    FIRST_STRONG_ISOLATE +
    balanceUnicodeDirControlChars(text) +
    POP_DIRECTIONAL_ISOLATE
  );
}

function balanceUnicodeDirControlChars(input: string): string {
  // This gets called by i18n code on many strings, so we want to avoid
  // as much work as possible
  if (!hasAnyUnicodeDirControlChars(input)) {
    return input;
  }

  let result = '';
  let formattingDepth = 0;
  let isolateDepth = 0;

  // We need to scan the entire input string and drop some characters as we
  // go in case they are closing something that was never opened.
  for (let index = 0; index < input.length; index += 1) {
    const char = input[index];
    switch (char) {
      case LTR_EMBEDDING:
      case RTL_EMBEDDING:
      case LTR_OVERRIDE:
      case RTL_OVERRIDE:
        formattingDepth += 1;
        result += char;
        break;
      case POP_DIRECTIONAL_FORMATTING:
        formattingDepth -= 1;
        // skip if its closing formatting that was never opened
        if (formattingDepth >= 0) {
          result += char;
        }
        break;
      case LTR_ISOLATE:
      case RTL_ISOLATE:
      case FIRST_STRONG_ISOLATE:
        isolateDepth += 1;
        result += char;
        break;
      case POP_DIRECTIONAL_ISOLATE:
        isolateDepth -= 1;
        // skip if its closing an isolate that was never opened
        if (isolateDepth >= 0) {
          result += char;
        }
        break;
      default:
        result += char;
        break;
    }
  }

  // Ensure everything is closed
  let suffix = '';
  if (formattingDepth > 0) {
    suffix += POP_DIRECTIONAL_FORMATTING.repeat(formattingDepth);
  }
  if (isolateDepth > 0) {
    suffix += POP_DIRECTIONAL_ISOLATE.repeat(isolateDepth);
  }

  return result + suffix;
}

By running text through this function before it ends up in HTML, we can at least prevent the malformed text from affecting its surrounding text which better aligns with the stated goal of <bdi>:

From [4.5.24]

The bdi element represents a span of text that is to be isolated from its surroundings for the purposes of bidirectional text formatting. [BIDI]

@fantasai suggested that this could possibly grafted onto <bdi>, <bdo>, and possibly all other uses of dir as well

Apr 05 '24 01:04 jamiebuilds

cc @whatwg/i18n

Apr 05 '24 15:04 annevk

I18N discussed this in our call of 2024-04-18 and I drew an action to respond.

The initial reaction I18N has was that this sounds like a problem and is worth solving. It feels like the idea is that element-level isolation is independent of text-level isolation.

I would probably not link it just to <bdi>/<bdo>, since we went to some effort to make the dir attribute isolating on all elements by default. Shouldn't this behavior be tied to the CSS style unicode-bidi: isolate? The intention is to isolate the textual contents.

The code looks like it might be effective.

Apr 18 '24 15:04 aphillips

Shouldn't this behavior be tied to the CSS style unicode-bidi: isolate? The intention is to isolate the textual contents.

@aphillips I'm not sure if that question was directed at me, but that makes enough sense to me. I don't have any strong opinions about specifically how this behavior is added, just a general problem statement.

Aug 19 '24 22:08 jamiebuilds

cc @whatwg/css @fantasai

Aug 20 '24 05:08 annevk

I wrote up some thoughts on this on Twitter, but I figure I'd write it up properly for this issue:

The thing that probably should be fixed is this text:

Following Unicode Bidirectional Algorithm clause HL3 [UAX9], values other than normal effectively insert the corresponding Unicode bidi control codes into the text stream at the start and end of the inline element before passing the paragraph to the Unicode bidirectional algorithm for reordering. (See § 2.4.2 CSS–Unicode Bidi Control Translation, Text Reordering.)

What you basically want this to do is for this to have the effect of inserting these control codes such that the corresponding push/pop directives are correctly retained in the bidi algorithm.

The bidi algorithm matches things up in BD8-BD11. A bidi implementation could potentially allow the caller to specify a priority as to how things get matched. The spec doesn't have any affordances for this, but it could (we could expand on HL3, for example). However I don't really see implementations moving to such a model easily; since the API for such a model is a mess.

Another way of fixing this is just sanitizing the content: remove any unmatched override/embedding/isolate initiators or pop directives. Not too complicated. Kind of annoying.

I think the best thing to do from the CSS spec standpoint is to spec these as basically creating a new bidi context such that their content is processed in a separate bidi algorithm run than their surroundings. In this case, both override and isolate would run the bidi algorithm on their content with strong directionality as the directionality of the override/isolate element, and their surrounding content treats them as opaque embedded content (similar to how inline images / other inline-block elements get handled today), with the override/embedding ones being treated as LRM/RLM in the surrounding content, and the isolates being treated exactly like other inline-block elements (as the spec says, use U+FFFC).

I believe isolate-override would be handled as an "auto" algorithm on the inner content and a neutral character (U+FFFC) on the surrounding content.

It's unclear to me exactly which spec (CSS2 visual formatting?) handles how inline-block elements in bidi text are handled so I'm not quite clear on how this is best written in spec text, but I think that's the best model to follow.

cc @eggrobin

Sep 23 '24 04:09 Manishearth

How is any other improperly paired element handled? Pasting "foo this is bold" or "foo non-italic"?

I don't see this as different. Other that this is potentially more "visually disruptive" than most other tags?

Well, some of the tags can also be disruptive than a simple bold. For example a  with a class or style that makes it hidden. Or with an id that is the same with another id that already exists in the target document?

And what about tags that are not explicitly included in the text I copy, but as a user I still expect the result to look the same:

<p style="color:red">Some text <b>from here on it is bold</b> and normal again.</p>
                     ^^^^^^^

If I copy-paste the marked section of text, when I paste it I expect it to be bold and red.

Are these already solved? If yes, can the bidi issue be solved the same way?

Sep 25 '24 15:09 mihnita

How is any other improperly paired element handled? Pasting "foo this is bold" or "foo non-italic"?

This is handled in the (X|XHT|HT)ML parser: closing tags are inserted automatically where necessary. (Unpaired closing tags are just ignored).

Unfortunately, handling bidi controls in the parser would be major scope creep.

It should absolutely be possible to represent unpaired bidi controls in XML, and it even should be possible to represent that in HTML, so sanitizing in the parser seems right out. Here, we want the application to be able to control where they actually "escape".

So this cannot be handled the same way, or at least cannot be handled in the same place.

The algorithm used could be the same: above I mention that one solution is to sanitize content by "remov[ing] any unmatched override/embedding/isolate initiators or pop directives", but a different way to do it would be to insert matching pop directives for unmatched control characters and remove lone pop directives.

It's not clear to me that sanitization is the best path forward. The main three paths I see are:

Sanitize the input text
- This can be done by removing all unpaired control characters, or
- This can be done by inserting pop characters where necessary and removing unmatched pops
Isolate the bidi algorithm run
Tweak the bidi algorithm to allow for higher-level control over pairing of bidi controls. I'm not in favor of this because I don't think bidi algorithm implementations will actually be able to provide a convenient API for this.

Sep 25 '24 18:09 Manishearth

html html copied to clipboard

Improve handling of malformed unicode bidi control characters

What is the issue with the HTML Standard?

html
html copied to clipboard