cainteoir-engine
cainteoir-engine copied to clipboard
create a text matcher class based on regular expressions
The tests/dictionary.py script has very simple regular expression expansion logic, placing limits on where you can place [ab]
and (a|b)
expressions.
This is limiting what can be expressed. For example A(d|dd)[iy]son
and A(d|dd)[iy]syn
should be expressed as A(d|dd)[iy]s[oy]n
, or -- with support for the ?
operator -- as Add?[iy]s[oy]n
.
In order to work properly, this requires a proper regular expression parser that generates a Match -> Phonemes
object model where Match
is a RegularExpression
matcher with subnodes for the different regular expression constructs.
Going beyond matching whole words, the rules found in eSpeak and the NRL Report 7948 use the form PreRule Rule PostRule
, where each of the PreRule
, Rule
and PostRule
can be a String
, RegularExpression
or EspeakExpression
, where EspeakExpression
uses the rule syntax defined by the eSpeak rule engine.
These should all be expressed in a common pattern state machine based on simple matching rules.
The Matcher
should support a match operation to check for a match in the given string for use in the text to phoneme converter.
The Matcher
should also be able to iterate over all permutations of strings that will match the underlying Matcher
object, which is what the dictionary expansion used by the tests/dictionary.py script does when generating the exception dictionary.
The tests/dictionary.py file now parses the regular expression into a regular expression matcher object model that it uses to iterate over the matching permutations.
It does not currently support repeaters (e.g. no+
for n(o|oo|ooo|oooo|...)
).
This should be build on top of the FSM in issue #21.