sioyek icon indicating copy to clipboard operation
sioyek copied to clipboard

Regex search causes stack overflow with super_fast_search enabled

Open aw-cloud opened this issue 7 months ago • 10 comments

Seems to be an old and well-documented issue with std::regex, see for example bug report.

Just based on usage, it seems like super_fast_search restricts the regex to the current page? Presumably that's effectively limiting the recursive calls inside std::regex_search so it's unlikely to crash. Is that the expected behaviour for super_fast_search? Seems a bit limiting to only be able to search the current page. At least it doesn't crash though.

There are alternate regex libraries such as Boost's regex or PCRE which don't have this flaw, could be worth moving to one of those or similar?

aw-cloud avatar May 16 '25 16:05 aw-cloud

No, super_fast_search does not restrict the regex to the current page. Why did you get that impression?

I don't plan to use third-party libraries for regex search as it is a relatively niche feature.

ahrm avatar May 16 '25 17:05 ahrm

Sorry, I meant to write super_fast_search 0, i.e. with it turned off regexes don't seem to match across page boundaries.

aw-cloud avatar May 16 '25 17:05 aw-cloud

Yes that is true, with super fast search turned off the regexes don't match across page boundaries.

ahrm avatar May 16 '25 17:05 ahrm

Is there a particular reason it works that way? Maybe regexes could always behave that way, at least by default, to avoid crashes until the bug gets fixed in the stl? Or is it so niche of a feature that I'm the only one who cares?

I'm happy to contribute code too, I'd just ask that you point me in the right direction so I don't implement something you wouldn't consider merging.

aw-cloud avatar May 16 '25 18:05 aw-cloud

The problem is that searching page-by-page reduces the search speed (which is the whole point of super fast index). I suggest compiling sioyek with clang, it is unlikely to have this bug.

ahrm avatar May 16 '25 19:05 ahrm

That's true which is why I originally suggested using something like boost. It would essentially keep the functionality and API the same but switch to a safer implementation.

I'm not sure why you think using clang would make a difference? The bug is in the regex library, nothing to do with the compiler as far as I can see. I tried it anyway in case it had some fancy stack protection features hidden away but still get crashes. I could be missing something though, I usually use gcc/g++.

aw-cloud avatar May 16 '25 20:05 aw-cloud

Doesn't clang use use its own STL implementation? (I am not sure about that but if that is the case it is unlikely to contain this bug which is in gcc's stl implementation).

By the way can you share an example of a document and a query that causes the crash? Because I can't reproduce it on my system (using clang).

ahrm avatar May 17 '25 06:05 ahrm

I did some research and LLVM does have its own STL implementation, libc++ vs gcc's libstdc++. I'm not sure that's the difference though as clang just links against whatever implementation you have unless you tell it otherwise. Qt6 is built against gcc's implementation too: Qt6 Docs. Gcc's reasoning for not implementing a fix was to preserve the existing ABI, so it's not surprising that trying to explicitly use LLVM, compiling with clang and linking with libc++, causes linktime errors because the ABIs don't match.

I wonder if something in CmakeLists.txt is producing different flags which affects the behaviour? Even though there's a line string(REPLACE "-mno-direct-extern-access" "" CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}") in the file, I still get that flag sent to the compiler. g++ just accepts it but I noticed it when clang complained about the unknown argument. Or maybe a difference between that and the build scripts? The linux script doesn't work for me though so I haven't been able to compare them.

I think every document I've tried that isn't tiny has triggered the bug. A basic .* pattern works. This Guide from TLDP is a concrete example.

aw-cloud avatar May 17 '25 12:05 aw-cloud

Yeah I can't reproduce the issue with clang (on mac), it just matches the whole document. Hopefully the issue will be solved soon on libstdc++ (honestly it is embarrassing that is has not been fixed in 5 years).

ahrm avatar May 17 '25 12:05 ahrm

Yeah hopefully. Maybe it'll take another 5 years though, at this rate who knows?

aw-cloud avatar May 17 '25 12:05 aw-cloud