unicodetools icon indicating copy to clipboard operation
unicodetools copied to clipboard

connect the sifter code to the Unicode Tools

Open markusicu opened this issue 3 years ago • 8 comments

The UCA sifter code that is checked into the Unicode Tools calls into ICU for access to character properties.

This is inconvenient:

  • It relies on extra setup to make it build against a recent version of ICU.
  • During UCA development, which is when we run the sifter, we need to use the latest draft Unicode properties data. By relying on ICU, we need to first integrate that snapshot of properties data into ICU and rebuild the sifter against that.

Quite cumbersome to do often, which means we don't run this version of the sifter as often as we should.

We should make the sifter independent of ICU, or at least independent of a version of ICU that has the latest draft properties data.

For example, a new Unicode Tool could write *_data.h files with data initializers for the data of relevant properties, and the sifter code could either use that data directly, or build ICU tries from it on the fly.

@Ken-Whistler @macchiati FYI

markusicu avatar Aug 12 '22 21:08 markusicu

Wish list:

  • Write a *_data.h file, e.g., ucd_data.h, with C initializers. Include that into unisifex.c. Check it into the same sifter folder.
  • For each boolean unisifex “property”, write an inversion list of range limit code points like in ICU UnicodeSet.
  • In unisifex.c, write a function like UnicodeSet::findCodePoint(c) to do the lookup. Plus the last few lines from UnicodeSet::contains(c) which is just above findCodePoint().
  • Sample C++ code for writing an array of values: ICU toolutil/writesrc.cpp usrc_writeArray()
  • There are some unisifex.c functions that map a code point to an integer value (e.g., Numeric_Value, which might be used only for gc=Nd). We should write an inversion map for each of those. Array of (range limit, value) pairs for consistency with the ICU inversion list, or maybe (range start, value). Maybe like struct CheckRange in some ICU test code.

With this, we can manually run the new tool, update the _data.h file, and then run the sifter code to generate a new allkeys.txt.

New characters need to be added to sifter/unidata.txt.

See recent commits that affected the sifter files for how Ken has been working on UCA data, and how I have been ingesting his updates: https://github.com/unicode-org/unicodetools/commits/main/c/uca/sifter

markusicu avatar Oct 03 '23 20:10 markusicu

PS: After some iterations in ICU, we settled there on the _data.h suffix for files with generated initializers, generated by calling writesrc.cpp functions. Examples:

  • https://github.com/unicode-org/icu/blob/main/icu4c/source/common/uchar_props_data.h
  • https://github.com/unicode-org/icu/blob/main/icu4c/source/common/ucase_props_data.h
  • https://github.com/unicode-org/icu/blob/main/icu4c/source/common/norm2_nfc_data.h

markusicu avatar Oct 03 '23 20:10 markusicu

Is unisift_IsIgnorable intentionally different from Default_Ignorable_Code_Point?

eggrobin avatar Oct 04 '23 13:10 eggrobin

Is unisift_IsIgnorable intentionally different from Default_Ignorable_Code_Point?

Yes.

eggrobin avatar Oct 04 '23 13:10 eggrobin

Wish list:

Your wishes have not quite been granted; I ended up just writing a modern C++ UCD parser (and CodePointRange/CodePointSet to hold the data, with boring std::maps for the non-default simple case mappings and integral numeric values) to back the extern unisift_ functions, since that seemed like this would be the fastest way to get things running (well, short of using the Ada UCD parser I wrote a few months ago, but let’s try to avoid situations where our tools are maintainable by only one Unicadetto).

See https://github.com/eggrobin/unicodetools/blob/sifter/c/uca/sifter/unisifeggs.cpp. Needs minor cleanup and splitting of the C++ classes into header files, but works, I have regenerated allkeys.txt with the only diff being

index 88ff7166..e6470571 100644
--- a/unicodetools/data/uca/dev/allkeys.txt
+++ b/unicodetools/data/uca/dev/allkeys.txt
@@ -1,5 +1,5 @@
 # allkeys-15.1.0.txt
-# Date: 2023-05-09, 11:32:56 GMT [KW]
+# Date: 2023-10-04, 16:51:44 GMT [KW]
 # Copyright 2023 Unicode, Inc.
 # For terms of use, see https://www.unicode.org/terms_of_use.html
 #

(Clearly we should pull the [KW] from an extern function…)

Aside:  there are some bells and whistles that I ended up spontaneously writing there that might be nice to have in ICU4C as part of those modernization ideas we had been discussing (e.g. iterability; as far as I can tell from a glance at the docs, its UnicodeSet, contrary to ICU4J’s, isn’t iterable in a way that is idiomatic for the language and compatible with the range-based loops, algorithms, etc.).

eggrobin avatar Oct 04 '23 15:10 eggrobin

Is unisift_IsIgnorable intentionally different from Default_Ignorable_Code_Point?

Yes.

The "Only in A" list is mostly [:Cn:] except for 7 other characters. Try [:DI:]-[:Cn:] for A.

markusicu avatar Oct 04 '23 15:10 markusicu

So, now that I can run the sifter, I still need to update unidata.txt for it to actually do something useful.

I gather this is the relevant documentation? https://github.com/unicode-org/unicodetools/blob/main/docs/uca/ducet.md

eggrobin avatar Oct 04 '23 18:10 eggrobin

So, now that I can run the sifter, I still need to update unidata.txt for it to actually do something useful.

I gather this is the relevant documentation? https://github.com/unicode-org/unicodetools/blob/main/docs/uca/ducet.md

That's from Unicode 10. Newer versions of those "work logs" are in the pull requests that you can find from here: https://github.com/unicode-org/unicodetools/commits/main/c/uca/sifter

I can help... probably next week.

We should ask @Ken-Whistler for the relative order of new (16.0) scripts vs. old scripts, on a whole-script basis.

markusicu avatar Oct 04 '23 18:10 markusicu