link-grammar Strippable affix class regexes

I finished implementing and testinmg it, and here are the examples I used:

% TODO: this list should be expanded with other "typical"(?) junk
% that is commonly (?) in broken texts.
-- ‒ – — ― "(" ")" "[" "]" ... ";" ±: MPUNC+;
% Split on comma's, but be careful with numbers:
% "The enzyme has a weight of 125,000 to 130,000"
% Also split on colons, but be careful not to mess up time
% expressions: "The train arrives at 13:42"
"/(?<!\d)[,:]|[,:](?!\d)/": MPUNC+;

In corpus-fixes.batch:

% Test tokenization by affix regexes.
% Sentence that should not be affected.
The enzyme has a weight of 125,000 to 130,000
The train arrives at 13:42
% Sentences that use punctuation without a trailing whitespace.
We used the same colors (red,blue,yellow).
The price of this item:$100

LPUNC and RPUNC also support regexes, and I tested with them /^[[:punct:]]/ and /[[:punct:]]$/ (respectively) in amy.

However, there is a problem: It is supported only when configured with PCRE2, and when configured with C++ the lookbehind regex compilation fails (not supported by C++). POSIX regexes (C library and TRE) also fail. (This is not really a problem for amy etc. since we don't need to support other regex libraries there.)

Possible solutions:

Distribute it with commented-out affix regexes and that's all.
Use autoconf to enable PCRE2 regexes if configure with PCRE2.
Add configuration file support for '#if SOMETHING' when SOMETHING is HAVE_POCRE2_H. 4.Only support PCRE2 on POSIX systems. (BTW, it is now easy for me to add PCRE2 support on MS-Windows too.)
Add support for regex library specification (easy to implement): "/(?<!\d)[,:]|[,:](?!\d)/PCRE2" (or even flag "e" for "extended").

I am for (5) and otherwise for (2) or (1).

Jul 31 '22 02:07 ampli

I am for (5) and otherwise for (2) or (1).

EDIT: Fix the POSIX regex.

I found a better solution, that all the regex libraries support: Instead of lookahead/lookbehind, use a capture group for the matching part. e.g, instead of: "/(?<!\d)[,:]|[,:](?!\d)/" use a POSIX regex: "/\d([,:]|[,:])\d/"

I will change the code to support this too.

EDIT yet again: "/\D([,:]|[,:])\D/"

EDIT: \D didn't work for me, but [^^d] did.

Jul 31 '22 14:07 ampli

@linas, To solve the split problem you pointed out in your comment on MPUNC, I implemented an MPUNC regex mechanism that uses lookahead/lookbehind (directly or indirectly) in a try not to split numbers with commas or times with colons. It works.

However, it seems there is a simpler solution that doesn't use a regex affix: Use : and , in MPUNC, and just don't MPUNC-split words that match a regex (in contrast to morpheme-split, that is done before trying a regex). I will try to implement that, and for now, leave the use of MPUNC-regex for the sake of any/ady/amy (as a simple split on [[:punct:]]).

Another thing: The corpus test sentences I used are not good enough: If they are not getting split as intended, they still parse fine, because the word with an internal colon or comma is looked up as UNKNOWN-WORD. But it is hard to find sentences that don't parse then. This is a general problem, that causes sentences with junk to get parsed just fine. Does this need a solution? If so, should we just have a regex category JUNK with no possible linkage, for words with junk in them?

Jul 31 '22 14:07 ampli

I said above:

just don't MPUNC-split words that match a regex [...] I will try to implement that, [...]

If a word contains 2 kinds of punctuations, one that has to be separated and one that should not, this way wouldn't work since the word could either match or not match a regex. So I will send the PR that splits by affix regexes.

EDIT:

just don't MPUNC-split words that match a regex [...]

This was a bad idea since in general such matches have nothing to do with word splits.

Aug 02 '22 00:08 ampli

link-grammar link-grammar copied to clipboard

Strippable affix class regexes

link-grammar
link-grammar copied to clipboard