plover Syntax for letting orthography know more about words than their spelling

Problem

English orthography rules can be written with regular expressions:

"tap"+"{^er}"="tapper" "tape"+"{^er}"="taper"

Languages with ambiguous spelling don't necessarily work like that. Here's an example of how to hard code the the spellings of some Danish words with suffixes:

"STÆÅL": "stil", // pronounced /'stelˤ/
"STAOÆL": "stil", // same spelling, but pronounced /'stiˤl/, which is considered to have a long vowel
// Some suffixes follow
"-E": "{^e}",
"-T": "{^et}",
"-D": "{^ede}",
"STÆÅL/-E": "stille", //  pronounced /'stel.e/
"STÆÅEL": "stille", // same suffix, folded in
"STAOÆL/-E": "stile", // pronounced /'stiːl.e/
"STAOÆEL": "stile", // tucked
"STÆÅL/-T": "stillet", //  pronounced /'stel.eð/
"STÆÅLT": "stillet", // tucked
"STAOÆL/-T": "stilet", // pronounced /'stiːl.eð/
"STAOÆLT": "stilet", // tucked
"STÆÅL/-D": "stillede", //  pronounced /'stel.eːð/
"STÆÅLD": "stillede", // tucked
"STAOÆL/-D": "stilede", // pronounced /'stiːl.eːð/
"STAOÆLD": "stilede" // tucked

All in all, for every word whose vowel length we can't guess from the Danish spelling, we'd currently need a separate dictionary for each combination of it and any of those suffixes that change the spelling depending on the vowel length, including the tucked versions if the suffix is foldable.

One possible solution

Have some syntax that allows conditional changes to the spelling of words. For example:

"STÆÅL": "stil{~+|l}"; // writes "stil" on its own, but changes it to "still" when adding suffixes
"STAOÆL": "stil";

And the suffixes could probably be written normally:

"-E": "{^e}",
"-T": "{^et}",
"-D": "{^ede}"

Now, writing "STAOÆL/-E" would produce "stile" and "STÆÅL/-E" would produce "stille".

For other languages, prefixes might interact with words in a similar way, and if prefixes can change the beginning of a word and suffixes can change the end, then we might want a syntax that allows for both things.

{+~+A|B} could write as A when in contact with a prefix or suffix, otherwise it would write B.
{+~A|B} could write as B when in contact with a prefix, otherwise it would write A.
{~+A|B} could write as B when in contact with a suffix, otherwise it would write A.
{+~+A|B|C|D} could write as A when not in contact with a prefix or suffix, A when in contact with a prefix and no suffix, C when in contact with a suffix and no prefix and D when in contact with both a prefix and a suffix.
Any of the parameters could be an empty string, but supplying empty strings as all of the parameters wouldn't be useful so we might want to throw a warning if that happens.

I think the one that takes four parameters probably won't be used for that many languages, but I thought I'd include just in case a speaker of such a language decides to try to use Plover for it. Alternatively, one could make these commands nestable, but that could easily get messy. As for the unlikely case that someone wants to use a pipe character inside one of the parameters, I guess it could be made escapable. I'm sure there are still some languages that this won't suffice for (e.g. words changing the spelling of adjacent words even though spaces are written between them, or multiple categories of prefixes or suffixes and words take different forms depending on the category of the prefix or suffix). But for Danish, this should be enough.

Another possible solution

Encode the stems (or themes, or parts of speech, or genders, or classes, or whatever a language requires) in the dictionary entries in a way that gets discarded from output but is visible to orthography regexes.

For this example, I'm assuming that plover treats {+:X}, whatever X is, as a do-nothing command, but doesn't remove it until after applying orthography rules. I'm trying to make an example with reasonable short and simple regexes, but in a real plugin I imagine I might use something a bit more complex to account for words that change the order of some letters.

Dictionary:

''STAOÆL": "stil",
"STÆÅL": "stil{+:l},
"-R": "{^er}",
"-B": "{^e}",
"-T": "{^et}",
"-D": "{^ede}",

Orthography:

(r'^(.*)(\{\+:)(.*)(\})(.*) \^ e$', r'\1\3\5e'),
(r'^(.*)(\{\+:)(.*)(\})(.*) \^ en$', r'\1\3\5en'),
(r'^(.*)(\{\+:)(.*)(\})(.*) \^ et$', r'\1\3\5et'),
(r'^(.*)(\{\+:)(.*)(\})(.*) \^ er$', r'\1\3\5er'),
(r'^(.*)(\{\+:)(.*)(\})(.*) \^ ede$', r'\1\3\5ede')

Aug 05 '18 15:08 SeaLiteral

Or maybe, as nimble suggested on Discord, invisible characters for the regexes to handle (which wouldn't themselves get output) would be a more flexible solution.

Edit: I'd written the wrong username of the Discord user that made this suggestion.

Aug 05 '18 19:08 SeaLiteral

I believe this is already implemented in Benoit Pierre's branch for the Melani system. The issue is https://github.com/openstenoproject/plover/issues/987

I'm not sure where the branch is with examples of how to use it.

On Sun, Aug 5, 2018, 3:10 PM SeaLiteral [email protected] wrote:

Or maybe, as fletchers suggested on Discord, invisible characters for the regexes to handle (which wouldn't themselves get output) would be a more flexible solution.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openstenoproject/plover/issues/990#issuecomment-410541040, or mute the thread https://github.com/notifications/unsubscribe-auth/AFkgSq3ubqg64bRvgXq0T1mWl2J1ea4oks5uN0MVgaJpZM4Vvbl9 .

Aug 05 '18 19:08 panathea

The feature in Melani seems to only look across word boundaries. So a word can depend on the next word but one part of the word can't depend on another part of the same word, which is what Danish needs. But I understand the two issues would be touching the same functions, so I guess one needs to be fixed first while the other waits (simultaneously fixing the two would probably make the changes hard to merge). I feel like I could fix this one in a couple of hours if I try, but if there are other people working on formatting.py at the same time I'd rather wait till they're done.

Aug 16 '18 21:08 SeaLiteral