unicodetools icon indicating copy to clipboard operation
unicodetools copied to clipboard

CheckProperties exception for RGI_Emoji_Flag_Sequence

Open markusicu opened this issue 2 years ago • 8 comments

When I run the unicodetools "CheckProperties" I see an exception getting thrown and printed, but the program continues and exits without a failure code.

Is this something to worry about?

Comment from KenW: The only difference I am seeing between 15.0 and 15.1 is that the flag of Turkey ended up described as "flag: Türkiye"

Thoughts from Markus:

  • If there is a problem, then we should fix it.
  • If there is no problem, then it should not throw an exception.
  • CheckProperties seems to swallow the exception, and finishes with exit code 0. This could hide real problems.
RGI_Emoji_Flag_Sequence 15.1.0.0
com.ibm.icu.util.ICUException: RGI_Emoji_Flag_Sequence( from: /usr/local/google/home/mscherer/unitools/mine/src/unicodetools/data/emoji/15.1/emoji-sequences.txt)
at org.unicode.props.IndexUnicodeProperties.load(IndexUnicodeProperties.java:446)
at org.unicode.propstest.CheckProperties.compare(CheckProperties.java:662)
at org.unicode.propstest.CheckProperties.main(CheckProperties.java:219)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.unicode.props.UnicodePropertyException: Key already present in Map: Basic_Emoji, old: 0000-2319=No
231A-231B=Yes
231C-23E8=No
23E9-23EC=Yes
...
1F6F0,FE0F=Yes
1F6F3,FE0F=Yes
, new:
at org.unicode.props.PropertyUtilities.putNew(PropertyUtilities.java:33)
at org.unicode.props.PropertyParsingInfo.parseSourceFile(PropertyParsingInfo.java:465)
at org.unicode.props.IndexUnicodeProperties.load(IndexUnicodeProperties.java:443)
... 8 more

@nedley @kenlunde @macchiati

markusicu avatar Sep 19 '23 22:09 markusicu

Running the tool just now gives me the same exception but for a different property:

Modifier_Combining_Mark	15.1.0.0
com.ibm.icu.util.ICUException: Modifier_Combining_Mark( from: /Volumes/Common/projects/unicode/unicodetools/unicodetools/data/ucd/15.1.0/PropList.txt)
	at org.unicode.props.IndexUnicodeProperties.load(IndexUnicodeProperties.java:492)
	at org.unicode.props.IndexUnicodeProperties.load(IndexUnicodeProperties.java:454)
	at org.unicode.propstest.CheckProperties.compare(CheckProperties.java:656)
	at org.unicode.propstest.CheckProperties.main(CheckProperties.java:219)
Caused by: org.unicode.props.UnicodePropertyException: Key already present in Map: Logical_Order_Exception,	old: 0000-0E3F=No
0E40-0E44=Yes
0E45-0EBF=No
0EC0-0EC4=Yes
0EC5-19B4=No
19B5-19B7=Yes
19B8-19B9=No
19BA=Yes
19BB-AAB4=No
AAB5-AAB6=Yes
AAB7-AAB8=No
AAB9=Yes
AABA=No
AABB-AABC=Yes
AABD-10FFFF=No
,	new: 
	at org.unicode.props.PropertyUtilities.putNew(PropertyUtilities.java:34)
	at org.unicode.props.PropertyParsingInfo.parseSourceFile(PropertyParsingInfo.java:501)
	at org.unicode.props.IndexUnicodeProperties.load(IndexUnicodeProperties.java:488)
	... 3 more

nedley avatar Sep 17 '24 20:09 nedley

Thoughts from Markus: CheckProperties seems to swallow the exception, and finishes with exit code 0. This could hide real problems.

CheckProperties was turned into a CI step by https://github.com/unicode-org/unicodetools/pull/253 for https://github.com/unicode-org/unicodetools/issues/187. But it was never written as a CI test; it was a tool you ran and whose output you read.

The documentation says

Run /unicodetools/org/unicode/propstest/CheckProperties.java to see what changed and sanity check.

Fix any failures, such as: [an exception]

This means that we are not testing whatever was only tested by that, and I do not think we are manually running this tool and looking at its output anymore either.

And I think the validity of values in data files is one of those things; this would explain the Arabic_Tashkil problem noted by @jowilco.

eggrobin avatar Apr 03 '25 14:04 eggrobin

Missing enum values are being caught by org.unicode.props.PropertyParsingInfo#checkEnum. The errors are stored in DATA_LOADING_ERRORS, but the only place that DATA_LOADING_ERRORS is being read is in CheckProperties.

jowilco avatar Apr 03 '25 18:04 jowilco

Indeed.

Which would fail, if it were a thing that could fail. In fact it fails so much (because of unmaintained regexes) that the checkEnum failure isn’t actually printed (but it occurs, see the CI failure on the second commit in https://github.com/unicode-org/unicodetools/pull/1083).

What a mess.

I think we should change checkEnum to throw directly, and deal with the rest of the CheckProperties errors another day—the regexes are known to be out of date, for instance.

@jowilco, @markusicu, what do you think?

eggrobin avatar Apr 03 '25 18:04 eggrobin

That works for me, but I'm not sure that I know enough to make the call.

jowilco avatar Apr 03 '25 18:04 jowilco

I think we should change checkEnum to throw directly, and deal with the rest of the CheckProperties errors another day—the regexes are known to be out of date, for instance.

@jowilco, @markusicu, what do you think?

sgtm changing checkEnum() to throw, and fixing what then fails, also sounds painful...

markusicu avatar Apr 03 '25 19:04 markusicu

changing checkEnum() to throw, and fixing what then fails, also sounds painful...

It is not too bad for the current version. On older versions it is a different story (today I learned about CCC=37 for U+0901 until Unicode 3.0…)

eggrobin avatar Apr 07 '25 10:04 eggrobin

(today I learned about CCC=37 for U+0901 until Unicode 3.0…)

I got curious about that. That went away in Unicode 2.1.8:

  • https://www.unicode.org/Public/2.1-Update2/UnicodeData-2.1.5.txt
  • https://www.unicode.org/Public/2.1-Update3/UnicodeData-2.1.8.txt
  • 1998-dec: https://www.unicode.org/history/publicationdates.html

In those days, many things about Unicode were not stable that have long since been stabilized. I believe Unicode normalization was stabilized in 3.0, modulo the NormalizationCorrections until 4.1 or so.

markusicu avatar Apr 07 '25 15:04 markusicu