link-grammar icon indicating copy to clipboard operation
link-grammar copied to clipboard

Implement affix regexes

Open ampli opened this issue 2 years ago • 1 comments

This patch implements affix regexes. The idea is to be able to strip off affixes and split words according to regexes.

I initially found it useful for amy/ady/any (one of the PRs I would like to send next). We still need to see if this feature is really useful for real languages and whether its implementation needs changes or extensions. Hence I labeled it in en/4.0.affix as "experimental". In this PR I defined MPUNC regexes for en. I didn't have too useful ideas for LPUNC/RPUNC (besides for amy/ady/any`).

To en/corpus-fixes.batch I added the example sentences from the comments in en/4.0.affix ` as sentences that should be neglected by these MPUNC regexes, and added two sentences demonstrating the ability of the MPUNC regexes. These sentences need a review and maybe more sentences should be added. I used simple regexes, but maybe more complex ones are needed to prevent bad splits.

In order to allow using regex libraries that don't support look-around, I added the ability to use a capture group to indicate the location of the affix. Since the subscript of regex affixes is not generally useful (besides using it to prevent converting a dot in the regex to SUBSCRIPT_MARK), I used it to denote this capture group, as follows: /regex/.\N when N is a digit 0..9. E.g: "/[^0-9]([,:])/.\1" means that capture group matches the affix. When defined as MPUNC, it separates the [,:] punctuations if they don't follow a number. (An alternative syntax can be implemented, that is harder to parse: /regex/\1/. In addition the dict code that converts to SUBSCRIPT_MARK can recognize and skip it, so subscripting it will not be needed. I chose to implement the subscript syntax for simplicity.)

The first group of commits (6 commits) is not directly related. I can separate it if desired (all of it but the tests.py patch should be applied first because it touches the same code).

While digging in the stripping/splitting code I found an idea to (hopefully) significantly sped it up that I would like to implement next.

ampli avatar Aug 08 '22 22:08 ampli

  1. Just force-pushed to fix a memory leak, found by Valgrind. Strangely, it is neglected by LSAN.
  2. See #1334 for my posts about affix regexes (but note that the regexes I mention there are all wrong...).

ampli avatar Aug 09 '22 18:08 ampli

Thanks! I read through the code; have no particular comments

linas avatar Aug 12 '22 14:08 linas