notepad-plus-plus icon indicating copy to clipboard operation
notepad-plus-plus copied to clipboard

Regex \x{0000}+ sometimes doesn't work

Open krilbe opened this issue 4 years ago • 7 comments

Description of the Issue

In a file I receive from an external source, updated monthly, there are a few occurrences of NUL, i.e. character codepoint 0 (zero). I need to clean them out, so I do a regex search for \x{0000}+. When I do that search from the top of the file, NP++ doesn't find anything and also shows an error in the status bar at the bottom of the search dialog, saying it's an invalid regex.

Screen dump

When I search for \x{0000} without the + at the end, it does find the NUL characters. It also finds them if I search for \x{0000}+ but start a bit from the top of the file. The file is ANSI ecoded, about 460000 lines long and about 55-60 Mbyte in size. The NUL characters appear around line 425000. The search for \x{0000}+ starts working from about line 23450 (and below). Above that, it never works. I can't see anything special with the file content at lines around 23450.

I may be able to provide a copy of the file for testing, but I'll have to ask a partner company for permission.

I've tried with e.g. \x{0030}+ (find any sequence of zeroes, character "0"). That seems to work everywhere.

Steps to Reproduce the Issue

  1. Open the affected file, go to top of file and search for regex \x{0000}+
  2. The search will not find anything and say the regex is invalid.
  3. Go to line 50000 or below and try again.
  4. The search will find the NUL characters.

Expected Behavior

The search should find the NUL characters regardless of where the search is started.

Actual Behavior

The search doesn't find the NUL characters if search is started at the top of the file (and down to about line 23450).

Debug Information

Notepad++ v8.2.1 (64-bit) Build time : Jan 19 2022 - 18:43:05 Path : C:\Program Files\Notepad++\notepad++.exe Command Line : Admin mode : OFF Local Conf mode : OFF Cloud Config : OFF OS Name : Windows 10 Pro (64-bit) OS Version : 2009 OS Build : 19044.1566 Current ANSI codepage : 1252 Plugins : mimeTools.dll NppConverter.dll NppExport.dll

krilbe avatar Mar 05 '22 12:03 krilbe

What does the hover-bubble indicate when you hover over it in this?:

image

A misc. note is that regex searching for null characters has always had problems, so this isn't anything new (and probably foretells how likely this is to get fixed).

Notepad++ is a text editor, not a general-purpose hex editor; perhaps a hex editor program is a better choice when dealing with nulls.

alankilborn avatar Mar 05 '22 12:03 alankilborn

bild

krilbe avatar Mar 05 '22 12:03 krilbe

So you understand from that info that the regular expression engine abandoned trying to do your search, right?

alankilborn avatar Mar 05 '22 13:03 alankilborn

Apparently, yes. But it shouldn't really... Any similar search for something not to do with NUL works fine, so it's not a general limitation re. file size, time to perform the search or any such thing. It's s bug related to NUL. Understood that it may not be likely to get fixed, but a bug it is. At least now it's known.

krilbe avatar Mar 05 '22 13:03 krilbe

For sure there are bugs here. It's funny but sometimes when I add info just for discussion purposes people think I am trying to deny bug existence -- not true.

alankilborn avatar Mar 05 '22 13:03 alankilborn

I have a regex that generates the same message, but only on the second time it is run in a file. Also, the problem is only certain files. I've so far been unable to get an example of the bug under 182 lines. I don't mind posting the entire file, but I don't want to spam anything or hijack a thread. The regex I'm using is this: (ge)( 2:1)|.*\K(?1).*(?2)|.*\K(?2).*(?1). Do you want more details here, or start a new thread?

cjbarth avatar Dec 29 '22 18:12 cjbarth

@cjbarth If you're inclined to, you should open a new issue as yours isn't null-related like the title of this one. However, if the regex engine can tell you that your expression plus your data is problematic, there isn't a lot that can be done.

alankilborn avatar Dec 29 '22 18:12 alankilborn