Notepad2e icon indicating copy to clipboard operation
Notepad2e copied to clipboard

Recursive Regexp Replace on $

Open ProgerXP opened this issue 5 years ago • 13 comments
trafficstars

Document:

aa
bb

Run Replace from $ to \n in regexp mode - the program will loop forever. Desired result:

aa

bb

ProgerXP avatar May 29 '20 11:05 ProgerXP

Additional cases:

Search string: $ Replace with: \n\1\1\n

aa


bb


Search string: (.*)$ Replace with: \n\1\1\n


aaaa


bbbb

cshnik avatar Aug 17 '20 20:08 cshnik

Fixed.

cshnik avatar Aug 17 '20 20:08 cshnik

`const int isRegexEOL = (szFind2[iFind2Length - 1] == '$') && ((iFind2Length < 2) || (szFind2[iFind2Length - 2] != '\\'));`

We should never try to parse regexps manually. This is not a universal fix. Can't this be fixed properly?

ProgerXP avatar Aug 18 '20 20:08 ProgerXP

I'm not aware of the way how this could be done without manual target range adjustment due to the following reason:

  1. RegEx works in line-by-line mode (see BoostRegexSearch::FindTextImpl()).
  2. $ is not EOL char, but the last position on line.

Thus, regex replace with a string containing \n will not automatically adjust the target range (will not skip the EOL from the current line) which caused specified hang.

Current fix allow regex to deal with multiline replace strings by manual next range calculation:

  1. If replace string contains EOLs then we should skip specific number of lines after regex replace.
  2. If find string has trailing $ than we should skip additional EOL (the one mentioned in the previous paragraph).

cshnik avatar Aug 19 '20 19:08 cshnik

It's true that $ is "0 characters long". However, we can go from the other side: instead of parsing the regexp (which we can never do correctly), see if there are line breaks in the replacement string and add this number of bytes to the match position.

This might fail if line breaks are retrieved from a capture group ((\n) -> $1) but this is a much more rare case than various combinations of regexp containing $ ($|foo, (foo|$), etc. etc.).

ProgerXP avatar Sep 05 '20 13:09 ProgerXP

instead of parsing the regexp (which we can never do correctly)

`const int isRegexEOL = (szFind2[iFind2Length - 1] == '$') && ((iFind2Length < 2) || (szFind2[iFind2Length - 2] != '\\'));`

This is not actual parsing. We only should detect if trailing character is unquoted $:

  1. Last char is $
  2. There is no other chars OR previous char is not \

This might fail if line breaks are retrieved from a capture group ((\n) -> $1) but this is a much more rare case than various combinations of regexp containing $ ($|foo, (foo|$), etc. etc.).

  1. None of your regex samples will trigger isRegexEOL = true-logic.
  2. The problem remains (hang on recursive regex replace) when using my samples from above.

Please confirm if current solution does not fit.

cshnik avatar Sep 16 '20 17:09 cshnik

Update: there is a missing case in current isRegexEOL detection logic: \\$ - regex to find line consisting of single \ will not be correctly processed since isRegexEOL=false.

Additional condition should be added: number of serial \ before $ should be even.

cshnik avatar Sep 16 '20 18:09 cshnik

Another bug that likely has same root cause:

a@b@c
a@b@c

Do Replace from ^[^@]+@ to empty string in regexp mode. Result:

c
c

Instead of:

b@c
b@c

In other words, regexp must never be applied more than once to the same region in the source file. Here it is applied to each line in a loop until there are no more matches in that line. This is incorrect, check tools like sed that do it properly:

(echo a@b@c; echo a@b@c) | sed -r 's/^[^@]+@//g'

ProgerXP avatar Oct 25 '20 18:10 ProgerXP

Additional test cases:

a@b@c
1
abc@b@c
2

Replace from: ^[^@]+@.*$ Replace to: \n

Result:



1


2

Replace from: ^([^@]+@.*)$ Replace to: \1X\n\1

Result:

a@b@cX
a@b@c
1
abc@b@cX
abc@b@c
2

Fixed.

cshnik avatar Oct 28 '20 18:10 cshnik

Document:

aa bb
aa bb

Replace from .* to z in regexp mode. Result:

zz
z

Instead of (.* is greedy, matches entire line):

z
z

ProgerXP avatar Oct 30 '20 20:10 ProgerXP

There was a change in RESearchRange class (came from Scintilla 3.11.2) which caused specified incorrect behavior when using BoostRegExSearch-implementation. Fixed.

cshnik avatar Nov 02 '20 17:11 cshnik

Any news on when “upcoming release”, which includes this fix, will be available as a binary?

sergeevabc avatar Apr 11 '21 11:04 sergeevabc

@sergeevabc Any news on when “upcoming release”, which includes this fix, will be available as a binary?

These fixes are already in beta builds that you can download from here. However, this is really tricky, I am still running into multiple problems with current implementation and I'm considering rolling it back entirely and try addressing the original issue ("Replace on $") in some other way.

What is your use case?

ProgerXP avatar Apr 11 '21 14:04 ProgerXP