Add option to make regular expression --case-sensitive
Is there a way to run BE with case sensitive regex? By default it searches for all matches case insensitive
I have tried changing the value in be20_api/regex_vector.cpp and rebuilding but it still returns a case insensitive result
void regex_vector::push_back(const std::string& val) {
RE2::Options options;
options.set_case_sensitive(true);
RE2 allows for appending ?i to the end of the regex to allow for case insensitivity.
A --case-insensitive flag could be useful and default BE to case sensitive.
Reference
It's very rare to want to do that, but we could add it as an option.
The main thing I've been trying to do is find an RE2 replacement that builds under MinGW, or even get RE2 to build under MinGW on docker.
the option would be a nice addition! since a regex can be specified it could be good to include parsing of regex flags for added customisation
I think yara would be a great alternative to RE2 as it can be installed on windows and allows for importing custom rules I see you have raised this as a potential idea before - https://github.com/simsong/bulk_extractor/issues/320
In relation to this issue, the yara documentation provides a solution built-in to each rule by adding the modifier nocase otherwise the regex will be case sensitive - https://yara.readthedocs.io/en/v3.4.0/writingrules.html#case-insensitive-strings
That's a fascinating idea. Unfortunately, YARA's REs are based on PCRE, so it will still die if someone gives us a RE like .*@company.com.
Would you like to add YARA to bulk_extractor?
I see! Maybe if the regex is sanitised before it is carried out for edge cases like this. I'm under the impression that there is a limit of 64 characters for the start of an email address - as per RFC 5321 - but I understand the concept of the problem in general may carry over. A warning message if BE detects this regex pattern may even suffice?
I think it would definitely solve some issues as well as allow for greater customisation!
It’s technically difficult to detect regular expressions that require an unlimited amount of time. Bulk_extractor feeds 16MB pages to the RE engine. If you want poor performance, you could feed 1KB pages. I experimented with this and wasn’t pleased with the results.
RE2 is a non-backtracking RE engine which doesn’t have this problem. It’s more complex.
I’m still excited for you to add YARA to BE.
On Jan 8, 2025, at 10:19 AM, h-ssh @.***> wrote:
I see! Maybe if the regex is sanitised before it is carried out for edge cases like this. I'm under the impression that there is a limit of 64 characters for the start of an email address - as per RFC 5321 https://datatracker.ietf.org/doc/html/rfc5321#section-4.5.3.1.1 - but I understand the concept of the problem in general may carry over. A warning message if BE detects this regex pattern may even suffice?
I think it would definitely solve some issues as well as allow for greater customisation!
— Reply to this email directly, view it on GitHub https://github.com/simsong/bulk_extractor/issues/483#issuecomment-2577934008, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMFHLDK77OCM65L4GL6TP32JU6W7AVCNFSM6AAAAABUYE6JJWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZXHEZTIMBQHA. You are receiving this because you commented.
RE2 is definitely a great solution and makes bulk extractor work very well. Would you be imagining a solution that makes use of both YARA and RE2 ? e.g. should YARA mimic the existing processes with rules built in or should the rules be imported by a user?
It sounds like an exciting project! I have quite a heavy workload at the moment so it might take a couple months until I can give it my full attention. I'm happy to take on the project but I feel its necessary to highlight the potential delay. In the meantime if anyone else would like to work with me on it, I'm more than happy to collaborate.
Would you be imagining a solution that makes use of both YARA and RE2 ? e.g. should YARA mimic the existing processes with rules built in or should the rules be imported by a user?
I don't know. You would need to implement YARA and then performance-test and tune it.
Thanks for the offer to help. Let me know if you want any guidance or further suggestions.