tecendil-js
tecendil-js copied to clipboard
Affixed forms of "words" exceptions don't get replaced
I’m sure you’ve long been aware of this issue and thought deeply on it!
Affixed forms of words in the exception lists don't get transformed unless specifically listed.
So while “ache” gets transformed, “aches”, “aching” and “ached” do not.
I have no simple solution to this problem though. You can’t just allow any letters to be appended as smaller words will appear within unrelated longer words (“thou” in “outhouse”). And you can’t even allow a filtered list of affixes as they can change the pronunciation of the stem (“outhouse” and “outhouses”; this particular example isn’t an issue though as only the apostrophe is added).
A very complicated solution would be to use the phonemic mode’s dictionary to source the pronunciation, then somehow automatically map the orthography back to that so Tecendil “knows” if an S is voiced, how every TH is pronounced and which final Es aren’t silent, etc. This would make much of "words" list unnecessary… if it were possible!
Thinking it through I suppose it isn't that complicated a system if you had a list of every phoneme a letter could individually represent in English and the same for every multigraph. I wonder if such an API already exists? Heteronyms like “abuse” would still be a problem as they are now… so Tecendil better map out the supplied sentence too, haha!
This is a problem called "lemmatization". Ultimately, because of the peculiarities of english, it requires a word list/dictionary of some kind. I'm not sure of how to improve on the current solution which is to have a specific word list in the mode file.
Apparently what I was trying to describe is called grapheme-phoneme mapping. This kids site demonstrates it well.
The best solution would be a dictionary API that provided that… but I’ve been unable to find one. The best I’ve found is this library from another project, but it only has 15,629 words, and that includes inflected forms. It might be enough, but I’d imagine not.
It seems a lot of study has gone into automating this mapping, generally for speech-to-text, so it is all way over my head!
What you are describing is what the English Phonemic mode of Tecendil does. It has a dictionary with 135,086 entries that is used to map the graphemes to phonemes. The phonemic transcription is displayed below the input. See for example https://www.tecendil.com/?q=ache%20aching&mode=english-phonemic
I understand that, I’m suggesting something for the reverse: using grapheme-phoneme mapping to replace an ever-increasing exception list for the (English) orthographic modes.
(Please be aware I’ve been suggesting pie in the sky options, I really don’t expect you to apply them, but I’m hoping the discussion might trigger some useful—and simpler—ideas!)
For example consider the word theatres. The library I previously linked to doesn’t have the plural form, so I’ll amend the singular entry to th-θ,e-i,a-ə,t-t,re-ə,s-z SUFFIX_ADD s. From this entry Tecendil—without a list of exceptions and with fewer rules—could infer that:
<th>is the voiceless dental fricative/θ/, and not/ð/,/t/,/tθ/or/t.h/<ea>is not a diphthong and so it might be more appropriate to split the vowels- while the
<r>is followed by a non-final<e>, it isn’t pronounced, so óre is appropriate (knowing when<r>is part of the final phoneme and is followed by a word starting with a vowel phoneme would be handy for linking-r too) <s>is voiced and is a suffix, useful to know for modes that only use hooks for inflectional forms; in this case a looped za-rince could be used
I hope that better explains some of the advantages of such a system. Onto the issues; few, but major!
- If a comprehensive library is found it will most likely only be for English, so that means a lot of work changing the Tecendil engine just for one language.
- Just like the existing phonemic library, it will only represent one form of pronunciation (I believe the existing library used by Tecendil is General American?). For orthographic use this isn’t that big of an issue as it is mostly the vowels that vary between accents, but there are cases of consonant differences (especially voiceless forms becoming voiced or vice-versa).
- Heteronyms are still an issue without a much more complicated system involving sentence mapping.
- Any words not in the library would have to fall back to the existing system so you really couldn’t shrink the list of rules and you’d still need an exception list for things like Sindarin words (if you couldn’t simply extend the library).