unicodetools
unicodetools copied to clipboard
Long lines in new ScriptExtensions.txt cause the space after # to disappear
Instead of # Po MIDDLE DOT
, the first data line in https://github.com/unicode-org/unicodetools/blob/main/unicodetools/data/ucd/dev/ScriptExtensions.txt reads #Po MIDDLE DOT
. I see no reason for this space to be dropped if the data is too long, while it's kept for lines with shorter data.
I had noticed this. It is weird, but it is consistent with what we do elsewhere, note in DerivedNormalizationProps.txt
FDF1 ; NFKC_CF; 0642 0644 06D2 # Lo ARABIC LIGATURE QALA USED AS KORANIC STOP SIGN ISOLATED FORM
FDF2 ; NFKC_CF; 0627 0644 0644 0647 #Lo ARABIC LIGATURE ALLAH ISOLATED FORM
FDF3 ; NFKC_CF; 0627 0643 0628 0631 #Lo ARABIC LIGATURE AKBAR ISOLATED FORM
FDF4 ; NFKC_CF; 0645 062D 0645 062F #Lo ARABIC LIGATURE MOHAMMAD ISOLATED FORM
FDF5 ; NFKC_CF; 0635 0644 0639 0645 #Lo ARABIC LIGATURE SALAM ISOLATED FORM
FDF6 ; NFKC_CF; 0631 0633 0648 0644 #Lo ARABIC LIGATURE RASOUL ISOLATED FORM
FDF7 ; NFKC_CF; 0639 0644 064A 0647 #Lo ARABIC LIGATURE ALAYHE ISOLATED FORM
FDF8 ; NFKC_CF; 0648 0633 0644 0645 #Lo ARABIC LIGATURE WASALLAM ISOLATED FORM
FDF9 ; NFKC_CF; 0635 0644 0649 # Lo ARABIC LIGATURE SALLA ISOLATED FORM
and it seems to be intentional, see this comment: https://github.com/unicode-org/unicodetools/blob/6f0c77d0d2b167a67ac54a9083db9a97b2882d82/unicodetools/src/main/java/org/unicode/props/BagFormatter.java#L519-L523
I have no idea what the intention is though. The commit that added that comment is https://github.com/unicode-org/icu/commit/cd418afef7899df376758301889b583ac9b8f849, its message is not particularly illuminating, and neither is ICU-6106. @macchiati, do you remember what you were thinking 16 years ago?
Let's stop doing this