medley Discussion: Permanent extension of the "Medley Character Set Standard"

Our current internal character encoding standard (MCCS) is basically XCCS with a few fiddles (dollar sign, left-arrow, which we haven't really decided on but would be handled by an explicit XCCS to MCCS mapping). Imporantly, MCCS codes correspond to the layout of glyphs in our fonts. This is an initial proposal for how to extend MCCS to produce MCCS-Unicode mappings for a large number of additional Unicode characters so that we can use Unicode fonts to eliminate all of our black boxes.

First some arithmetic:

MCCS/Unicode mappings are based on the mappings for 10730 XCCS codes in 105 character sets. Let's assume that the XCCS standard does not define any other XCCS codes that would be relevant to MCCS, i.e. we would never encounter XCCS-coded files that have standard codes outside of the mappings that we currently have.

That leaves 54805 smallp's that can be assigned to other Unicode characters.

But XCCS/MCCS does not allow 255 in any character set: 54805 - 256 = 54549 (actually, we don't need to preserve this constraint in internal MCCS--a separate issue)

Codes in the MCCS META and FUNCTION character sets are reserved.
54549 - 2*255 = 54039 (511 since we have already taken out the 255 codes) (We should map them to MCCS codes that are undefined in both XCCS and Unicode so that we don't have to deal with them again if we ever go to Unicode internally.)

We want to reserve some number of MCCS codes for local faking of unmapped codes, say 4 character sets = 4255 54039 - 4255 = 53019

Thus we have 53019 smallp MCCS codes that can be assigned to otherwise unassigned Unicode characters.

(Side note: the maximum size of an Interlisp hash table is 32749. So we have to divide codes into at least 2 separate hash buckets.)

The maximum number of Unicode plane 0 codes is 65535 - 6400 (reserved) = 59135 (maybe some others are not defined in the Unicode standard, I didn't check).

So we can't have smallp MCCS codes for 6116 smallp Unicodes (less whatever we want to allow from higher ups--emojis....)

A simple strategy for permanent extension to MCCS:

Make a list of character sets (or specific characters) in Unicode that we don't care about and another list of available MCCS codes (as calculated above). 

Then for all defined Unicode characters U from 0 to 65535 (plus others beyond that that we care about):

	Unless (UTOXCODE? U),  assign the next available MCCS code to U.

	For all the assignments thus constructed, write out MCCS-to-Unicode mapping files for the new/changed character sets.

A more sophisticated strategy, as Matt has suggested, would be to try to assign all the UNICODE characters in a given character set to MCCS codes also in a single character set, as long as a completely free character set is available. Unicode character sets would get dispersed only after contiguous codes have been exhausted.

We can then use this permanent extension to MCCS to map the glyphs from Unicode fonts into our internal MCCS-ordered font character sets, a la the work that Matt has been doing.

Feb 20 '25 21:02 rmkaplan

(I included tabs above, didn't know that git would fiddle the formatting)

Feb 20 '25 21:02 rmkaplan

The maximum number of Unicode plane 0 codes is 65535 - 6400 (reserved) = 59135 (maybe some others are not defined in the Unicode standard, I didn't check).

The Unicode code points U+D800 through U+DFFF, are reserved as the High-half and Low-half surrogates necessary for UTF-16 (see Wikipedia UTF-16) and will never be assigned a character. (2048 code points). However, Unicode explicitly "says that no UTF forms, including UTF-16, can encode the surrogate code points." (i.e., no unpaired surrogates)

Also, U+E000 through U+F8FF is the Private Use Area. No characters will be assigned there by the Unicode standard. So, there should never be a Unicode standard character in that range. (6400 code points). These code points never need to be permanently assigned to codepoints in our MCCS.

Again, from Wikipedia Unicode: "A small set of code points are guaranteed never to be assigned to characters, although third-parties may make independent use of them at their discretion. There are 66 of these noncharacters: U+FDD0–U+FDEF and the last two code points in each of the 17 planes (e.g. U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ..., U+10FFFE, U+10FFFF). The set of noncharacters is stable, and no new noncharacters will ever be defined.[66] Like surrogates, the rule that these cannot be used is often ignored, although the operation of the byte order mark assumes that U+FFFE will never be the first code point in a text." (34 code points in plane 0)

If Medley does not, and will never, support right-to-left languages, then there are more character blocks that could be removed. However, Hebrew is defined in XCCS, so to leave open the possibility of adding right-to-left eventually, I recommend that we do not explicitly remove those character blocks. (Users, could enter those characters and manually arrange them as right-to-left.)

So there are at least 6400+2048+34 = 8482 characters from the BMP that we do not need to assign into MCCS. Would we want to add to MCCS the concept of the character modifiers as used for emojis and some national flags, for example? (I guess not.)

Feb 21 '25 00:02 MattHeffron

A more sophisticated strategy, as Matt has suggested, would be to try to assign all the UNICODE characters in a given character set to MCCS codes also in a single character set, as long as a completely free character set is available.

My initial suggestion was a subset of this. Add characters from Unicode to the same MCCS character set as already present. E.g., Greek in set 046; only 105 characters currently assigned there in XCCS). Extending the idea as you presented makes sense.

Feb 21 '25 00:02 MattHeffron

Would we want to add to MCCS the concept of the character modifiers as used for emojis and some national flags, for example? (I guess not.)

I think we're more likely to see emojii with modifiers in the Unicode text we might process than we are to see many other characters we're mapping.

but I'm not sure that this is at the right level. The handling of modifier characters is more like the handling of (unnormalized) accented characters isn't it? When you have a sequence of characters, the transformation to a sequence of glyphs from fonts is not 1-1. This is an issue for the fonts and not the character codes.

Feb 21 '25 23:02 masinter

A reference in preparation in the IETF is https://www.ietf.org/archive/id/draft-bray-unichars-11.html#

for some background about unicode subsets.

Mar 04 '25 20:03 masinter

The link seems broken. I get a 404.

On Tue, Mar 4, 2025, 12:46 PM Larry Masinter @.***> wrote:

A reference in preparation in the IETF is https://www.ietf.org/archive/id/draft-bray-unichar for some background about unicode subsets.

— Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/2040#issuecomment-2698863568, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7BB4VEIW7BL2WIWX4TQ6D2SYGJ3AVCNFSM6AAAAABXRYPH46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJYHA3DGNJWHA . You are receiving this because you commented.Message ID: @.***> [image: masinter]masinter left a comment (Interlisp/medley#2040) https://github.com/Interlisp/medley/issues/2040#issuecomment-2698863568

A reference in preparation in the IETF is https://www.ietf.org/archive/id/draft-bray-unichar for some background about unicode subsets.

— Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/2040#issuecomment-2698863568, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7BB4VEIW7BL2WIWX4TQ6D2SYGJ3AVCNFSM6AAAAABXRYPH46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJYHA3DGNJWHA . You are receiving this because you commented.Message ID: @.***>

Mar 04 '25 21:03 MattHeffron

Sometimes I got a 404 too but now the website, most of the times. Try removing the # from the end of the URL.

Mar 04 '25 21:03 pamoroso

The link as appears in the email notification is what is broken. It's incomplete. It worked when I used the link from here (minus the #)

Mar 05 '25 00:03 MattHeffron

It used to be you could shorten the URL for Internet Drafts by removing the fyle type and version number. That didn't work, so I edited the message in the GitHub issue (but of course I can't update the email I'd already sent.)

Mar 05 '25 17:03 masinter

The handling of modifier characters is more like the handling of (unnormalized) accented characters isn't it?

The point of my asking about modifier characters was to make sure that they are included in the planning for MCCS. For example, which versions of modifiable emojis we'd support (male/female variation of a person emoji, we probably can skip color selections), and how we'd do it.

Mar 06 '25 05:03 MattHeffron

medley medley copied to clipboard

Discussion: Permanent extension of the "Medley Character Set Standard"

medley
medley copied to clipboard