link-grammar
link-grammar copied to clipboard
Strippable affix class regexes
I finished implementing and testinmg it, and here are the examples I used:
% TODO: this list should be expanded with other "typical"(?) junk
% that is commonly (?) in broken texts.
-- ‒ – — ― "(" ")" "[" "]" ... ";" ±: MPUNC+;
% Split on comma's, but be careful with numbers:
% "The enzyme has a weight of 125,000 to 130,000"
% Also split on colons, but be careful not to mess up time
% expressions: "The train arrives at 13:42"
"/(?<!\d)[,:]|[,:](?!\d)/": MPUNC+;
In corpus-fixes.batch
:
% Test tokenization by affix regexes.
% Sentence that should not be affected.
The enzyme has a weight of 125,000 to 130,000
The train arrives at 13:42
% Sentences that use punctuation without a trailing whitespace.
We used the same colors (red,blue,yellow).
The price of this item:$100
LPUNC and RPUNC also support regexes, and I tested with them /^[[:punct:]]/
and /[[:punct:]]$/
(respectively) in amy
.
However, there is a problem: It is supported only when configured with PCRE2, and when configured with C++ the lookbehind regex compilation fails (not supported by C++). POSIX regexes (C library and TRE) also fail. (This is not really a problem for amy
etc. since we don't need to support other regex libraries there.)
Possible solutions:
- Distribute it with commented-out affix regexes and that's all.
- Use autoconf to enable PCRE2 regexes if configure with PCRE2.
- Add configuration file support for '#if SOMETHING' when SOMETHING is
HAVE_POCRE2_H
. 4.Only support PCRE2 on POSIX systems. (BTW, it is now easy for me to add PCRE2 support on MS-Windows too.) - Add support for regex library specification (easy to implement):
"/(?<!\d)[,:]|[,:](?!\d)/PCRE2"
(or even flag "e" for "extended").
I am for (5) and otherwise for (2) or (1).
I am for (5) and otherwise for (2) or (1).
EDIT: Fix the POSIX regex.
I found a better solution, that all the regex libraries support:
Instead of lookahead/lookbehind, use a capture group for the matching part.
e.g, instead of:
"/(?<!\d)[,:]|[,:](?!\d)/"
use a POSIX regex:
"/\d([,:]|[,:])\d/"
I will change the code to support this too.
EDIT yet again:
"/\D([,:]|[,:])\D/"
EDIT: \D didn't work for me, but [^^d] did.
@linas, To solve the split problem you pointed out in your comment on MPUNC, I implemented an MPUNC regex mechanism that uses lookahead/lookbehind (directly or indirectly) in a try not to split numbers with commas or times with colons. It works.
However, it seems there is a simpler solution that doesn't use a regex affix: Use :
and ,
in MPUNC, and just don't MPUNC-split words that match a regex (in contrast to morpheme-split, that is done before trying a regex).
I will try to implement that, and for now, leave the use of MPUNC-regex for the sake of any/ady/amy (as a simple split on [[:punct:]]
).
Another thing: The corpus test sentences I used are not good enough: If they are not getting split as intended, they still parse fine, because the word with an internal colon or comma is looked up as UNKNOWN-WORD. But it is hard to find sentences that don't parse then. This is a general problem, that causes sentences with junk to get parsed just fine. Does this need a solution? If so, should we just have a regex category JUNK with no possible linkage, for words with junk in them?
I said above:
just don't MPUNC-split words that match a regex [...] I will try to implement that, [...]
If a word contains 2 kinds of punctuations, one that has to be separated and one that should not, this way wouldn't work since the word could either match or not match a regex. So I will send the PR that splits by affix regexes.
EDIT:
just don't MPUNC-split words that match a regex [...]
This was a bad idea since in general such matches have nothing to do with word splits.