Syntax for letting orthography know more about words than their spelling
Problem
English orthography rules can be written with regular expressions:
"tap"+"{^er}"="tapper" "tape"+"{^er}"="taper"
Languages with ambiguous spelling don't necessarily work like that. Here's an example of how to hard code the the spellings of some Danish words with suffixes:
"STÆÅL": "stil", // pronounced /'stelˤ/
"STAOÆL": "stil", // same spelling, but pronounced /'stiˤl/, which is considered to have a long vowel
// Some suffixes follow
"-E": "{^e}",
"-T": "{^et}",
"-D": "{^ede}",
"STÆÅL/-E": "stille", // pronounced /'stel.e/
"STÆÅEL": "stille", // same suffix, folded in
"STAOÆL/-E": "stile", // pronounced /'stiːl.e/
"STAOÆEL": "stile", // tucked
"STÆÅL/-T": "stillet", // pronounced /'stel.eð/
"STÆÅLT": "stillet", // tucked
"STAOÆL/-T": "stilet", // pronounced /'stiːl.eð/
"STAOÆLT": "stilet", // tucked
"STÆÅL/-D": "stillede", // pronounced /'stel.eːð/
"STÆÅLD": "stillede", // tucked
"STAOÆL/-D": "stilede", // pronounced /'stiːl.eːð/
"STAOÆLD": "stilede" // tucked
All in all, for every word whose vowel length we can't guess from the Danish spelling, we'd currently need a separate dictionary for each combination of it and any of those suffixes that change the spelling depending on the vowel length, including the tucked versions if the suffix is foldable.
One possible solution
Have some syntax that allows conditional changes to the spelling of words. For example:
"STÆÅL": "stil{~+|l}"; // writes "stil" on its own, but changes it to "still" when adding suffixes
"STAOÆL": "stil";
And the suffixes could probably be written normally:
"-E": "{^e}",
"-T": "{^et}",
"-D": "{^ede}"
Now, writing "STAOÆL/-E" would produce "stile" and "STÆÅL/-E" would produce "stille".
For other languages, prefixes might interact with words in a similar way, and if prefixes can change the beginning of a word and suffixes can change the end, then we might want a syntax that allows for both things.
{+~+A|B}could write asAwhen in contact with a prefix or suffix, otherwise it would writeB.{+~A|B}could write asBwhen in contact with a prefix, otherwise it would writeA.{~+A|B}could write asBwhen in contact with a suffix, otherwise it would writeA.{+~+A|B|C|D}could write asAwhen not in contact with a prefix or suffix,Awhen in contact with a prefix and no suffix,Cwhen in contact with a suffix and no prefix andDwhen in contact with both a prefix and a suffix.- Any of the parameters could be an empty string, but supplying empty strings as all of the parameters wouldn't be useful so we might want to throw a warning if that happens.
I think the one that takes four parameters probably won't be used for that many languages, but I thought I'd include just in case a speaker of such a language decides to try to use Plover for it. Alternatively, one could make these commands nestable, but that could easily get messy. As for the unlikely case that someone wants to use a pipe character inside one of the parameters, I guess it could be made escapable. I'm sure there are still some languages that this won't suffice for (e.g. words changing the spelling of adjacent words even though spaces are written between them, or multiple categories of prefixes or suffixes and words take different forms depending on the category of the prefix or suffix). But for Danish, this should be enough.
Another possible solution
Encode the stems (or themes, or parts of speech, or genders, or classes, or whatever a language requires) in the dictionary entries in a way that gets discarded from output but is visible to orthography regexes.
For this example, I'm assuming that plover treats {+:X}, whatever X is, as a do-nothing command, but doesn't remove it until after applying orthography rules. I'm trying to make an example with reasonable short and simple regexes, but in a real plugin I imagine I might use something a bit more complex to account for words that change the order of some letters.
Dictionary:
''STAOÆL": "stil",
"STÆÅL": "stil{+:l},
"-R": "{^er}",
"-B": "{^e}",
"-T": "{^et}",
"-D": "{^ede}",
Orthography:
(r'^(.*)(\{\+:)(.*)(\})(.*) \^ e$', r'\1\3\5e'),
(r'^(.*)(\{\+:)(.*)(\})(.*) \^ en$', r'\1\3\5en'),
(r'^(.*)(\{\+:)(.*)(\})(.*) \^ et$', r'\1\3\5et'),
(r'^(.*)(\{\+:)(.*)(\})(.*) \^ er$', r'\1\3\5er'),
(r'^(.*)(\{\+:)(.*)(\})(.*) \^ ede$', r'\1\3\5ede')
Or maybe, as nimble suggested on Discord, invisible characters for the regexes to handle (which wouldn't themselves get output) would be a more flexible solution.
Edit: I'd written the wrong username of the Discord user that made this suggestion.
I believe this is already implemented in Benoit Pierre's branch for the Melani system. The issue is https://github.com/openstenoproject/plover/issues/987
I'm not sure where the branch is with examples of how to use it.
On Sun, Aug 5, 2018, 3:10 PM SeaLiteral [email protected] wrote:
Or maybe, as fletchers suggested on Discord, invisible characters for the regexes to handle (which wouldn't themselves get output) would be a more flexible solution.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openstenoproject/plover/issues/990#issuecomment-410541040, or mute the thread https://github.com/notifications/unsubscribe-auth/AFkgSq3ubqg64bRvgXq0T1mWl2J1ea4oks5uN0MVgaJpZM4Vvbl9 .
The feature in Melani seems to only look across word boundaries. So a word can depend on the next word but one part of the word can't depend on another part of the same word, which is what Danish needs. But I understand the two issues would be touching the same functions, so I guess one needs to be fixed first while the other waits (simultaneously fixing the two would probably make the changes hard to merge). I feel like I could fix this one in a couple of hours if I try, but if there are other people working on formatting.py at the same time I'd rather wait till they're done.