rust-unic icon indicating copy to clipboard operation
rust-unic copied to clipboard

API to override char property values for Private-Use chars

Open behnam opened this issue 8 years ago • 3 comments

From http://www.unicode.org/faq/private_use.html:

Private-use characters are code points whose interpretation is not specified by a character encoding standard and whose use and interpretation may be determined by private agreement among cooperating users. Private-use characters are sometimes also referred to as user-defined characters (UDC) or vendor-defined characters (VDC).

One should not expect the rest of an operating system to override the character properties for private-use characters, since private use characters can have different meanings, depending on how they originated. In terms of line breaking, case conversions, and other textual processes, private-use characters will typically be treated by the operating system as otherwise undistinguished letters (or ideographs) with no uppercase/lowercase distinctions.

Basically, a system can assign its own internal meaning to PUA characters, and with meaning comes the character properties. UNIC should allow override of property values for PUA characters.

How we can do that in Rust while maintaining Cargo package boundaries could be tricky and needs some pondering.

What assumptions can we make?

  • I assume it's safe to assume that any override would affect any and all instances of UNIC libraries in existence, even when only used internally by some dependencies.

  • The above comes with the assumption that non of the dependent libraries is assigning a meaning to any PUA char.

  • And, since it's logical to have libraries that assign PUA chars, to be used by other libraries, we need to make sure parallel assignments do not conflict in anyway; meaning that either the codepoints don't overlap, or if they do, all the char property values overridden are exactly the same.

I think this is one of those areas that would require cutting edge features of rustc. We need to investigate more on implementation solutions.

In addition:

  • We may also want to provide a query method for PUAs, in the same level as the definition. In other words, if we use compiler plugins to assign PUAs, we should provide a compile-time query method for the current state of PUA assignments.

  • We need to make sure any sensitive area, like Security Mechanisms, blocks any PUA on its own boundary. I believe there are parts of the specs, but we need to double-check this.

behnam avatar Jul 26 '17 17:07 behnam

Boring, touches-every-table, would-work-now solution:

  • New crate, unic-pua-override
  • All crates have an optional feature/dependency on unic-pua-override
    • This is extended to any UNIC dependencies of a UNIC crate
  • unic-pua-override has a static RwLock<Map<char, Property>> for every property
  • An api is exposed to insert mappings for Private Use characters, which enforces the invariants in the OP in some manner
  • The main thread is expected to initialize unic-pua-override with any overrides before querying properties
  • Any time a table is checked, it first checks unic-pua-override for overrides

Once/If we move to using a consistent "Table object", the override check could be localized to one location.

CAD97 avatar Aug 14 '17 01:08 CAD97

Thanks, @CAD97, for writing it up. Right, basically this is the simple but more verbose solution. I also want to look into any procedural-marco-based solution, as a way to eliminate some of the complexity here, like expecting every binary to initialize UNIC (or some other library that initializes UNIC) to do this.

This can become a problem in the wild, because of something like this:

  • Package A calls UNIC initialization because it has some PUA defined.
  • Package B uses lib A and calls its init, plus defining some PUA itself.
  • Later, package A dropping the UNIC init because it doesn't need it anymore.
  • Package B will get a breakage in the form of its defined PUA not working, without apparent reason.

So, hopefully, we can actually enforce initializations at the same time of PUA definitions, in a declarative form. (Not passing in all the PUA descriptions in the init call-site itself.)

behnam avatar Aug 14 '17 02:08 behnam

That doesn't have to cause breakage.

Sketch:

unic-pua-override

override_names = LazyStatic RwLock Map<Char, Name> Map::new

put_name_override char name = do override_override[char] <- name
get_name_override char = override_names[char]

unic-ucd-name

get_name char = // table lookup

get_name char = 
  (unic-pua-override::get_name_override char) or (get_name char)

library

register_private_use_characters = do
  unic-pua-override::put_name_override '\u{PRIV}' "Super Secret"

// other useful stuff

binary

main = do
  library::register_private_use_characters
  unic-pua-override::put_name_override '\u{VIRP}' "terceS repuS"
  // application logic
  unic-char-name::get_name '\u{VIRP}' == "terceS repuS"
  unic-char-name::get_name '\u{PRIV}' == "Super Secret"
    // or None if library noops register_private_use_characters

(I think that syntax is Haskell-inspired? 🤷‍♂️)

CAD97 avatar Aug 14 '17 02:08 CAD97