Handling of digraphs and trigraphs
It would be useful to establish a systematic approach to digraphs (and generally any multi-graphs) as they are sometimes considered part of the standardised alphabet. It is not the case in English, but it is the case in Czech (ch) or Hungarian (cs, dz, dzs, gy, ly, ny, sz, ty, zs [from Wikipedia]).
They are not too important from type-design perspective as the individual characters may combine with more characters than those in digraphs. Hypothetically, their list could be used to inform more meaningful decorative ligatures, but that is about it, I think.
They are important if we are encoding a standardised orthographies as without them, these are incomplete. They are also important for sorting, but it is not something we deal with.
Technically, base+mark combinations can be seen as digraphs, too, and we include those.
What do we think?
In as far as they are part of the official orthography I am in favor of this. After all, the database is not collecting design requirements but orthographies.
Also relevant for https://github.com/rosettatype/hyperglot/pull/114
For #114 there was a duplicate n that probably came from digraph ny in the alphabet.
It could be removed since digraphs are not kept in or be kept in if that is the new scheme.
A few digraphs are interesting but n-grams would be just as much if not more relevant. A lot of languages do not define their digraphs in their alphabets, and in many cases they are not interesting. Knowing a language uses n-grams fì, qj or įj is more interesting than knowing it uses digraphs cz, gh or rr for example.
Since the digraphs-as-letter-of-alphabet info is generally available, it could be added.
I can't pinpoint the commit where we changed how hyperglot-save parses the characters, but di/trigraphs are as of this writing retained as they are in the character lists, so this is now about adding that data to orthographies where in the past we have not retained those combinations. Also the letters comprising a di/trigraph are no longer extracted and appended to the orthography on saving (this is done only on parsing the list, just to confirm all individual characters are in fact added to the check).
I suppose something like n-grams/possible/common combinations is out of scope (at least for now); if that is what @moyogo was referring to. Compiling a list of possible combinations is one thing, retaining "interesting" combinations is another. E.g. those samples would imply to me that those are useful to check for kerning collisions, but how to pick?
Related to this as well: hyperglot-save retains the order of characters in base (I vaguely remember this being the trigger why we changed the saving implementation), so there likely are orthographies where we should fix the order of characters to correctly represent the official orthography.
See #172 for some discussion related to including upper case variants of digraphs; it's sort of unclear if the "uppercase" variants of digraphs should be double upper (like in squ) case or title case (like in Czech/Hungarian orthographic references).