bulk_extractor icon indicating copy to clipboard operation
bulk_extractor copied to clipboard

Add option to make regular expression --case-sensitive

Open h-ssh opened this issue 1 year ago • 8 comments

Is there a way to run BE with case sensitive regex? By default it searches for all matches case insensitive

I have tried changing the value in be20_api/regex_vector.cpp and rebuilding but it still returns a case insensitive result

void regex_vector::push_back(const std::string& val) {
    RE2::Options options;
    options.set_case_sensitive(true);

RE2 allows for appending ?i to the end of the regex to allow for case insensitivity. A --case-insensitive flag could be useful and default BE to case sensitive.

Reference

RE2 ignore case

h-ssh avatar Jan 07 '25 17:01 h-ssh

It's very rare to want to do that, but we could add it as an option.

simsong avatar Jan 07 '25 17:01 simsong

The main thing I've been trying to do is find an RE2 replacement that builds under MinGW, or even get RE2 to build under MinGW on docker.

simsong avatar Jan 07 '25 17:01 simsong

the option would be a nice addition! since a regex can be specified it could be good to include parsing of regex flags for added customisation

I think yara would be a great alternative to RE2 as it can be installed on windows and allows for importing custom rules I see you have raised this as a potential idea before - https://github.com/simsong/bulk_extractor/issues/320

In relation to this issue, the yara documentation provides a solution built-in to each rule by adding the modifier nocase otherwise the regex will be case sensitive - https://yara.readthedocs.io/en/v3.4.0/writingrules.html#case-insensitive-strings

h-ssh avatar Jan 08 '25 13:01 h-ssh

That's a fascinating idea. Unfortunately, YARA's REs are based on PCRE, so it will still die if someone gives us a RE like .*@company.com.

Would you like to add YARA to bulk_extractor?

simsong avatar Jan 08 '25 14:01 simsong

I see! Maybe if the regex is sanitised before it is carried out for edge cases like this. I'm under the impression that there is a limit of 64 characters for the start of an email address - as per RFC 5321 - but I understand the concept of the problem in general may carry over. A warning message if BE detects this regex pattern may even suffice?

I think it would definitely solve some issues as well as allow for greater customisation!

h-ssh avatar Jan 08 '25 15:01 h-ssh

It’s technically difficult to detect regular expressions that require an unlimited amount of time. Bulk_extractor feeds 16MB pages to the RE engine. If you want poor performance, you could feed 1KB pages. I experimented with this and wasn’t pleased with the results.

RE2 is a non-backtracking RE engine which doesn’t have this problem. It’s more complex.

I’m still excited for you to add YARA to BE.

On Jan 8, 2025, at 10:19 AM, h-ssh @.***> wrote:

I see! Maybe if the regex is sanitised before it is carried out for edge cases like this. I'm under the impression that there is a limit of 64 characters for the start of an email address - as per RFC 5321 https://datatracker.ietf.org/doc/html/rfc5321#section-4.5.3.1.1 - but I understand the concept of the problem in general may carry over. A warning message if BE detects this regex pattern may even suffice?

I think it would definitely solve some issues as well as allow for greater customisation!

— Reply to this email directly, view it on GitHub https://github.com/simsong/bulk_extractor/issues/483#issuecomment-2577934008, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMFHLDK77OCM65L4GL6TP32JU6W7AVCNFSM6AAAAABUYE6JJWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZXHEZTIMBQHA. You are receiving this because you commented.

simsong avatar Jan 08 '25 15:01 simsong

RE2 is definitely a great solution and makes bulk extractor work very well. Would you be imagining a solution that makes use of both YARA and RE2 ? e.g. should YARA mimic the existing processes with rules built in or should the rules be imported by a user?

It sounds like an exciting project! I have quite a heavy workload at the moment so it might take a couple months until I can give it my full attention. I'm happy to take on the project but I feel its necessary to highlight the potential delay. In the meantime if anyone else would like to work with me on it, I'm more than happy to collaborate.

h-ssh avatar Jan 14 '25 11:01 h-ssh

Would you be imagining a solution that makes use of both YARA and RE2 ? e.g. should YARA mimic the existing processes with rules built in or should the rules be imported by a user?

I don't know. You would need to implement YARA and then performance-test and tune it.

Thanks for the offer to help. Let me know if you want any guidance or further suggestions.

simsong avatar Jan 14 '25 14:01 simsong