link-grammar
link-grammar copied to clipboard
Implement affix regexes
This patch implements affix regexes. The idea is to be able to strip off affixes and split words according to regexes.
I initially found it useful for amy
/ady
/any
(one of the PRs I would like to send next).
We still need to see if this feature is really useful for real languages and whether its implementation needs changes or extensions. Hence I labeled it in en/4.0.affix
as "experimental".
In this PR I defined MPUNC regexes for en. I didn't have too useful ideas for LPUNC/RPUNC (besides for
amy/
ady/
any`).
To en/corpus-fixes.batch
I added the example sentences from the comments in en/4.0.affix
` as sentences that should be
neglected by these MPUNC regexes, and added two sentences demonstrating the ability of the MPUNC regexes.
These sentences need a review and maybe more sentences should be added.
I used simple regexes, but maybe more complex ones are needed to prevent bad splits.
In order to allow using regex libraries that don't support look-around, I added the ability to use a capture group to indicate the location of the affix. Since the subscript of regex affixes is not generally useful (besides using it to prevent converting a dot in the regex to SUBSCRIPT_MARK), I used it to denote this capture group, as follows:
/regex/.\N
when N is a digit 0..9.
E.g: "/[^0-9]([,:])/.\1"
means that capture group matches the affix. When defined as MPUNC, it separates the [,:]
punctuations if they don't follow a number. (An alternative syntax can be implemented, that is harder to parse: /regex/\1/
. In addition the dict code that converts to SUBSCRIPT_MARK can recognize and skip it, so subscripting it will not be needed. I chose to implement the subscript syntax for simplicity.)
The first group of commits (6 commits) is not directly related. I can separate it if desired (all of it but the tests.py patch should be applied first because it touches the same code).
While digging in the stripping/splitting code I found an idea to (hopefully) significantly sped it up that I would like to implement next.
- Just force-pushed to fix a memory leak, found by Valgrind. Strangely, it is neglected by LSAN.
- See #1334 for my posts about affix regexes (but note that the regexes I mention there are all wrong...).
Thanks! I read through the code; have no particular comments