Predefined character class replacement inside square brackets is incorrect
Generex currently replaces predefined character classes while wrapping them in square brackets: \d becomes [0-9]
However, if the \d is already in a character class expression then [\d] becomes [[0-9]], which is compilable correctly by java.util.regex.Pattern but not by dk.brics.automaton.Automaton used by Generex.
Simple regex replacement is apparently not enough, it looks like contextual replacement is needed (tracking if \d is inside [..] char by char, tracking already escaped \)
Input: [\d] (Java String literal "[\\d]")
Expected output:
- transformed regex
[0-9] - all matched strings:
0
1
2
3
4
5
6
7
8
9
Actual output:
- transformed regex
[[0-9]] - all matched strings:
0]
1]
2]
3]
4]
5]
6]
7]
8]
9]
[]
Problem
It should be also taken into consideration that the backslash in the character class could have been escaped.
Input: \\d (Java string literal "\\\\d")
Expected output:
- transformed regex:
\\d(no change) - matched string:
\d
Actual output:
- transformed regex:
\[0-9] - matched string:
[0-9]
Proposed solution
Pattern.compile(
"(?<!\\\\)" + // (?<!\\) no preceding backslash allowed
"(?<slashes>(\\\\\\\\)*)" + // (\\\\)* literal backslash allowed {0,} times
"\\\\d" // \\d single backslash and the character class letter
);
Don't forget to include slashes when replacing, they had to be captured (in pairs) because the negative look-behind does not allow infinite *.