libpalaso icon indicating copy to clipboard operation
libpalaso copied to clipboard

ICU collation parsing and generation problems

Open mhosken opened this issue 3 years ago • 2 comments

There are a number of issues with the collation rules in ICU syntax that it would be good to resolve. I think a short example might help. Here is the first line of a simple sort order specification: a/A aa á/Á, and the resulting start of the generated ICU style collation tailoring: [before 1] [first regular] < a\/A << aa << á\/Á.

Looking at how ICU parses rule strings, it distinguishes strings and syntactic elements. Thus < is a syntactic element as is /. Thus a/A is parsed as 3 elements a / and A which is an expansion that effectively says sort a after the previously element with an A appended. On the other hand if / is escaped, as in a\/A (as per generated LDML) that treats the / as part of the string and is parsed as a single string of a/A. Which is not what is wanted either. The correct way to interpret / in the simple ordering is to treat it as a 3rd level thus a/A would convert to a <<< A.

In general, this means that:

  • syntactic parts of the collation rule should not be escaped
  • syntactic elements that are part of collation element strings, should be escaped

I think this means you can't just run the whole collation rule through a general escaper/unescaper. Instead the escaping needs to be inserted when the collation rule is generated from the simple rules. I.e. the ICU generator produces syntactically correct ICU tailoring from the get go and that just gets copied into the LDML inside a CDATA section. No extra escaping is needed outside of what ICU wants to see.

And just to rub it in. The current LDML collation rules, therefore, are junky and cannot be used by any other tools. For example, when I read in LDML from DBL bundles, I dump the ICU collation and regenerate it (complete with minimisation) from the simple order. I notice that SIL.WritingSystems does the same in ignoring the ICU tailoring, which could explain why the generated ICU rules aren't getting any testing?

mhosken avatar Aug 20 '21 04:08 mhosken

@mhosken Can you add a pointer to the file that has the problem? That would help someone not so familiar with how everything works...

ermshiperete avatar Aug 20 '21 09:08 ermshiperete

SIL.WritingSystem/LdmlCollationParser.cs. The output is a simple copy of the data directly, but the parser does some transformation of the tailoring string. I wonder if it should be the other way around and the generation from Simple to ICU would do all the escaping. The mapping from LDML to ICU is 1:1 with no transformation needed.

mhosken avatar Aug 20 '21 15:08 mhosken