Generex icon indicating copy to clipboard operation
Generex copied to clipboard

Predefined character class replacement inside square brackets is incorrect

Open HawkSK opened this issue 5 years ago • 1 comments

Generex currently replaces predefined character classes while wrapping them in square brackets: \d becomes [0-9] However, if the \d is already in a character class expression then [\d] becomes [[0-9]], which is compilable correctly by java.util.regex.Pattern but not by dk.brics.automaton.Automaton used by Generex. Simple regex replacement is apparently not enough, it looks like contextual replacement is needed (tracking if \d is inside [..] char by char, tracking already escaped \)

Input: [\d] (Java String literal "[\\d]")

Expected output:

  • transformed regex [0-9]
  • all matched strings:
0
1
2
3
4
5
6
7
8
9

Actual output:

  • transformed regex [[0-9]]
  • all matched strings:
0]
1]
2]
3]
4]
5]
6]
7]
8]
9]
[]

HawkSK avatar Aug 01 '20 13:08 HawkSK

Problem

It should be also taken into consideration that the backslash in the character class could have been escaped.

Input: \\d (Java string literal "\\\\d")

Expected output:

  • transformed regex: \\d (no change)
  • matched string: \d

Actual output:

  • transformed regex: \[0-9]
  • matched string: [0-9]

Proposed solution

Pattern.compile(
	"(?<!\\\\)" +			// (?<!\\)	no preceding backslash allowed
	"(?<slashes>(\\\\\\\\)*)" +	// (\\\\)*	literal backslash allowed {0,} times
	"\\\\d"				// \\d		single backslash and the character class letter
);

Don't forget to include slashes when replacing, they had to be captured (in pairs) because the negative look-behind does not allow infinite *.

HawkSK avatar Aug 16 '20 19:08 HawkSK