cainteoir-engine icon indicating copy to clipboard operation
cainteoir-engine copied to clipboard

create a text matcher class based on regular expressions

Open rhdunn opened this issue 13 years ago • 2 comments

The tests/dictionary.py script has very simple regular expression expansion logic, placing limits on where you can place [ab] and (a|b) expressions.

This is limiting what can be expressed. For example A(d|dd)[iy]son and A(d|dd)[iy]syn should be expressed as A(d|dd)[iy]s[oy]n, or -- with support for the ? operator -- as Add?[iy]s[oy]n.

In order to work properly, this requires a proper regular expression parser that generates a Match -> Phonemes object model where Match is a RegularExpression matcher with subnodes for the different regular expression constructs.

Going beyond matching whole words, the rules found in eSpeak and the NRL Report 7948 use the form PreRule Rule PostRule, where each of the PreRule, Rule and PostRule can be a String, RegularExpression or EspeakExpression, where EspeakExpression uses the rule syntax defined by the eSpeak rule engine.

These should all be expressed in a common pattern state machine based on simple matching rules.

The Matcher should support a match operation to check for a match in the given string for use in the text to phoneme converter.

The Matcher should also be able to iterate over all permutations of strings that will match the underlying Matcher object, which is what the dictionary expansion used by the tests/dictionary.py script does when generating the exception dictionary.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/1026798-create-a-text-matcher-class-based-on-regular-expressions?utm_campaign=plugin&utm_content=tracker%2F254961&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F254961&utm_medium=issues&utm_source=github).

rhdunn avatar Sep 24 '11 11:09 rhdunn

The tests/dictionary.py file now parses the regular expression into a regular expression matcher object model that it uses to iterate over the matching permutations.

It does not currently support repeaters (e.g. no+ for n(o|oo|ooo|oooo|...)).

rhdunn avatar Nov 03 '11 10:11 rhdunn

This should be build on top of the FSM in issue #21.

rhdunn avatar Sep 19 '12 15:09 rhdunn