unicodetools icon indicating copy to clipboard operation
unicodetools copied to clipboard

VersionedProperty is garbled when reading some old files

Open eggrobin opened this issue 2 years ago • 2 comments

Spotted while working on #432, I got character.jsp to tell me that U+0041 went

Identifier_Type 9.0:Not_Character 10.0:Recommended

Which seems wrong, and indeed is; the default value changed between 9.0 and 10.0, from Recommended to Not_Character.

Likewise for U+4E00:

Block 2.0..3.0: 9FFF 3.1..15.1: CJK_Unified_Ideographs

(The format of Blocks.txt changed after 3.0, from first; last; <block> to first..last; <block>.)

eggrobin avatar Jan 16 '24 00:01 eggrobin

Broadened this issue to cover the wider issue of files changing format. Cleverness will be needed at some point.

eggrobin avatar Jan 19 '24 01:01 eggrobin

Documentation needs a bit of work!

Blocks: There is provision for variant formats in the data files; look at IndexUnicodeProperties, eg:

# Unicode 13 moves kTotalStrokes to Unihan_IRGSources.txt.
# The line with the new location (i.e., the line without version number)
# must occur in this file before the line with the old location.
Unihan_DictionaryLikeData ; kTotalStrokes ; v12.1

For (unique) radical changes like that, it may be better to special case in the code. Ugg.

Identifier_Type: We could extend ExtraPropertyValueAliases.txt in the same way; a trailing Age means that age or below.

macchiati avatar Jan 19 '24 02:01 macchiati