unicodetools icon indicating copy to clipboard operation
unicodetools copied to clipboard

can we coalesce quotation mark CE lists into single CEs?

Open markusicu opened this issue 1 year ago • 4 comments

I remember that I added some logic for the CLDR version of the default sort order to coalesce some adjacent CEs, absorbing ignorable CEs into their main CEs. Look for how that works for things like sharp s, or generally look for existing differences between allkeys_CLDR.txt and allkeys_DUCET.txt, and see whether we can turn this

0027  ; [*0337.0020.0002] # APOSTROPHE
FF07  ; [*0337.0020.0003] # FULLWIDTH APOSTROPHE
2018  ; [*0337.0020.0004][.0000.011E.0004] # LEFT SINGLE QUOTATION MARK
2019  ; [*0337.0020.0004][.0000.011F.0004] # RIGHT SINGLE QUOTATION MARK

into something like this

0027  ; [*0337.0020.0002] # APOSTROPHE
FF07  ; [*0337.0020.0003] # FULLWIDTH APOSTROPHE
2018  ; [*0337.0021.0002] # LEFT SINGLE QUOTATION MARK
2019  ; [*0337.0022.0002] # RIGHT SINGLE QUOTATION MARK

I found the coalescing code, and I had misremembered where I put it. It's in MappingsForFractionalUCA.java modifyMappings() // Check and merge secondary CEs.

It does not modify the "UCA" mappings. It only modifies intermediate mappings that turn into FractionalUCA.txt mappings. I verified that allkeys_CLDR.txt and allkeys_DUCET.txt have the same number of non-initial ignorable CEs. And FractionalUCA.txt shows the merged byte-based CEs:

0027; [09 6E, 05, 05]	# Zyyy Po	[0337.0020.0002]	* APOSTROPHE
FF07; [09 6E, 05, 20]	# Zyyy Po	[0337.0020.0003]	* FULLWIDTH APOSTROPHE
2018; [09 6E, 70, 05]	# Zyyy Pi	[0337.0020.0004][0000.011E.0004]	* LEFT SINGLE QUOTATION MARK
2019; [09 6E, 73, 05]	# Zyyy Pf	[0337.0020.0004][0000.011F.0004]	* RIGHT SINGLE QUOTATION MARK

The code includes comments about the modified mappings not being well-formed. It should be possible to make them well-formed, since the resulting FractionalUCA mappings are well-formed.

If we wanted to, we could then try to move this logic up one or two levels:

  1. up into the "UCA" object and its mappings, and thus visible in allkeys_CLDR.txt and allkeys_DUCET.txt
  2. further up into the C sifter code

Either way, the FractionalUCA generator would need to be adjusted for working with non-ignorable CEs having non-default secondary weights.

markusicu avatar Aug 26 '24 22:08 markusicu

Looks reasonable. From what you wrote here, it looks like there aren't any characters in the second case between FF07 and 2018. Is that still true with your change?

macchiati avatar Aug 26 '24 23:08 macchiati

This is the case according to the allkeys_CLDR.txt file which is in sorted order. I have to remind myself what the code looks like that I thought would do this, and see what's different from a case like sharp s. Anyway, this is just a drive-by thought that I wanted to jot down. The real work for today is https://github.com/unicode-org/unicodetools/pull/926 :-)

markusicu avatar Aug 26 '24 23:08 markusicu

I thought it was sorted by shifted values ... not a real sort. Although in this instance maybe that doesn't matter.

On Mon, Aug 26, 2024, 16:39 Markus Scherer @.***> wrote:

This is the case according to the allkeys_CLDR.txt file which is in sorted order. I have to remind myself what the code looks like that I thought would do this, and see what's different from a case like sharp s. Anyway, this is just a drive-by thought that I wanted to jot down. The real work for today is #926 https://github.com/unicode-org/unicodetools/pull/926 :-)

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/unicodetools/issues/927#issuecomment-2311293127, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMB7PHFUCQUEGBBCRHLZTO4BRAVCNFSM6AAAAABNE5O35GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJRGI4TGMJSG4 . You are receiving this because you commented.Message ID: @.***>

macchiati avatar Aug 27 '24 04:08 macchiati

I thought it was sorted by shifted values ... not a real sort.

The real UCA allkeys.txt is sorted with something like alternate=shifted (not sure if that's completely true, and I think it might sort with strength=tertiary, dropping the shifted primaries, making ignorable characters come out in a somewhat random order).

The allkeys_CLDR.txt and allkeys_DUCET.txt that the Unicode Tools generate are sorted with alternate=non-ignorable.


FYI: I found the coalescing code, and I amended the issue description above a few minutes ago.

markusicu avatar Aug 27 '24 04:08 markusicu