bulk_extractor Run yara-x rules against small blocks of sbufs

This is currently draft. The idea is to link against the new yara-x library with its C API (https://virustotal.github.io/yara-x/docs/api/c/c-/) and then run rules against 4KB-sized chunks contained in each sbuf. Like lightgrep, yara invokes a callback when there's a match on a file, and the name of the matching rule is written out to the feature recorder along with the pos_t of the 4KB block (and, oops, just realized I have a bug here, oh well, it's draft).

What's left to do:

take yara rules location from a CLI arg
load yara rules from the CLI arg (finding all .yar files recursively if the path is a directory)
fix pos_t bug above
some unit testing
more error-handling
outputting the "namespace" of a matching rule along with its identifier
outputting some more context??
adding an option to change the block size from 4KB?? (an old Python script called page_brute used 4KB)
documentation??????

Mar 22 '25 21:03 jonstewart

Also, the yara_x.h header has a bug. It's valid C but uses a C++ reserved keyword as a variable name in a function argument. It's trivial to fix, but yara-x does not work out of the box at present. I reported the issue to the project.

Mar 22 '25 21:03 jonstewart

Why 4kb chunks and not 16kb or even 64kb?

On Sat, Mar 22, 2025 at 5:20 PM Jon Stewart @.***> wrote:

This is currently draft. The idea is to link against the new yara-x library with its C API (https://virustotal.github.io/yara-x/docs/api/c/c-/) and then run rules against 4KB-sized chunks contained in each sbuf. Like lightgrep, yara invokes a callback when there's a match on a file, and the name of the matching rule is written out to the feature recorder along with the pos_t of the 4KB block (and, oops, just realized I have a bug here, oh well, it's draft).

What's left to do:

take yara rules location from a CLI arg

load yara rules from the CLI arg (finding all .yar files recursively if the path is a directory)

fix pos_t bug above

some unit testing

more error-handling

outputting the "namespace" of a matching rule along with its identifier

outputting some more context??

adding an option to change the block size from 4KB?? (an old Python script called page_brute used 4KB)

documentation??????

You can view, comment on, or merge this pull request online at:

https://github.com/simsong/bulk_extractor/pull/487 Commit Summary

a8e6941 https://github.com/simsong/bulk_extractor/pull/487/commits/a8e69413c73bec26402cd9f409207e0e99f5f74c configure should check for existence of yara_x_capi and link against it if enabled and present

f079e26 https://github.com/simsong/bulk_extractor/pull/487/commits/f079e26481f25eb115b04532f029f5c420cdd205 create a stub scanner for scan_yarax, with the config symbol being HAVE_YARAX

af88d0e https://github.com/simsong/bulk_extractor/pull/487/commits/af88d0e71ed945f98e4a8594a6a8dc6af9c2c490 basic scanning of sbufs with yara_x implemented. This uses a hard-coded rule that is unlikely to generate matches and it also does not use feature-recorders to record any matches. But it does get called and pass tests with the address_sanitizer. I will need to think of some ways to unit test this.

bbd5cdf https://github.com/simsong/bulk_extractor/pull/487/commits/bbd5cdf1578b01aaaeca371d8edc15bb1a86e645 Create a yara-x feature recorder and write the names of matching rules with the block/forensic path of the data block that triggered the rule.

File Changes

(4 files https://github.com/simsong/bulk_extractor/pull/487/files)

M configure.ac https://github.com/simsong/bulk_extractor/pull/487/files#diff-49473dca262eeab3b4a43002adb08b4db31020d190caaad1594b47f1d5daa810 (33)

M src/Makefile.am https://github.com/simsong/bulk_extractor/pull/487/files#diff-4cb884d03ebb901069e4ee5de5d02538c40dd9b39919c615d8eaa9d364bbbd77 (1)

M src/bulk_extractor_scanners.h https://github.com/simsong/bulk_extractor/pull/487/files#diff-3d434f929500bdb0227bcc46b6f64c504c1105a61cef4501390917b53a9f7661 (3)

A src/scan_yarax.cpp https://github.com/simsong/bulk_extractor/pull/487/files#diff-e049b5b12f54f3ad765df054afc201c5c27527a1ccb2563f49da35f3a58108ab (111)

Patch Links:

https://github.com/simsong/bulk_extractor/pull/487.patch

https://github.com/simsong/bulk_extractor/pull/487.diff

— Reply to this email directly, view it on GitHub https://github.com/simsong/bulk_extractor/pull/487, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMFHLBI6B5L5M3WZ2RFV3T2VXH2PAVCNFSM6AAAAABZSIVT2GVHI2DSMVQWIX3LMV43ASLTON2WKOZSHE2DANZRGU3DAMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Mar 22 '25 22:03 simsong

Why 4kb chunks and not 16kb or even 64kb?

Fair question. Yara rules are designed to be run against files, not raw data. The rule language reminds me of a certain infamous Richard Gabriel essay... A typical yara rule is propelled by multipattern string search, but then has conditions to evaluate any discovered matches. Those conditions may contain assertions about matches being at various offsets, byte values at various offsets (independent of string/pattern search), and file sizes.

It seems common practice for yara rules to make assertions of byte values at/around offset 0 as a poor man's file signature/magic matching routine (there are many rules asserting that the file starts with "PE" so as to apply only to Windows executables). It also seems like common practice to use file size to exclude rule matches on very large files (i.e., most assertions are size < X).

So, there's a fundamental mismatch here between yara rules (file-oriented) and bulk_extractor (data-oriented). I think choosing the relatively small 4KB size offers the best chance of discovering potential matches as it's the typical sector size and memory page size. However, it could be that some rules making assertions about file signatures are also expecting their pattern matches much further into the file. We may need to run the rules duplicatively in ~1MB chunks, too, to stand the best chance of triggering rule matches.

I hope to get a better idea of these tradeoffs by analyzing a few large open source yara rulebases.

I believe it is also possible to get yara to report the raw pattern matches in data, even if the rules don't match. A separate feature recorder can be made for these, so that even if a rule doesn't match (due to condition assertions) a user can inspect the pattern matches and potentially discover fragments of malware.

Mar 23 '25 14:03 jonstewart

Why 4kb chunks and not 16kb or even 64kb?

Fair question. Yara rules are designed to be run against files, not raw data. The rule language reminds me of a certain infamous Richard Gabriel essay... A typical yara rule is propelled by multipattern string search, but then has conditions to evaluate any discovered matches. Those conditions may contain assertions about matches being at various offsets, byte values at various offsets (independent of string/pattern search), and file sizes.

It seems common practice for yara rules to make assertions of byte values at/around offset 0 as a poor man's file signature/magic matching routine (there are many rules asserting that the file starts with "PE" so as to apply only to Windows executables). It also seems like common practice to use file size to exclude rule matches on very large files (i.e., most assertions are size < X).

So, there's a fundamental mismatch here between yara rules (file-oriented) and bulk_extractor (data-oriented). I think choosing the relatively small 4KB size offers the best chance of discovering potential matches as it's the typical sector size and memory page size. However, it could be that some rules making assertions about file signatures are also expecting their pattern matches much further into the file. We may need to run the rules duplicatively in ~1MB chunks, too, to stand the best chance of triggering rule matches.

I hope to get a better idea of these tradeoffs by analyzing a few large open source yara rulebases.

I believe it is also possible to get yara to report the raw pattern matches in data, even if the rules don't match. A separate feature recorder can be made for these, so that even if a rule doesn't match (due to condition assertions) a user can inspect the pattern matches and potentially discover fragments of malware.

I think that you should make the page size a parameter so it can be tuned at runtime. Bulk_extractor's feature recorder system already does de-duplication of features that are reported twice. This is what the entire 'margin' system is about. Did you look at my implementation for the legacy regular expression matching system? It does most of what you are describing above --- breaking the image into chunks to give to the RE engine.

Mar 23 '25 14:03 simsong

I don’t disagree about the need to specify a different block size at runtime.One question: is there a way with a feature recorder to provide a string (for example, the associated rule name with a pattern match) but have b_e carve surrounding context automatically?On Mar 23, 2025, at 10:43 AM, Simson L. Garfinkel @.***> wrote:

Why 4kb chunks and not 16kb or even 64kb?

Fair question. Yara rules are designed to be run against files, not raw data. The rule language reminds me of a certain infamous Richard Gabriel essay... A typical yara rule is propelled by multipattern string search, but then has conditions to evaluate any discovered matches. Those conditions may contain assertions about matches being at various offsets, byte values at various offsets (independent of string/pattern search), and file sizes. It seems common practice for yara rules to make assertions of byte values at/around offset 0 as a poor man's file signature/magic matching routine (there are many rules asserting that the file starts with "PE" so as to apply only to Windows executables). It also seems like common practice to use file size to exclude rule matches on very large files (i.e., most assertions are size < X). So, there's a fundamental mismatch here between yara rules (file-oriented) and bulk_extractor (data-oriented). I think choosing the relatively small 4KB size offers the best chance of discovering potential matches as it's the typical sector size and memory page size. However, it could be that some rules making assertions about file signatures are also expecting their pattern matches much further into the file. We may need to run the rules duplicatively in ~1MB chunks, too, to stand the best chance of triggering rule matches. I hope to get a better idea of these tradeoffs by analyzing a few large open source yara rulebases. I believe it is also possible to get yara to report the raw pattern matches in data, even if the rules don't match. A separate feature recorder can be made for these, so that even if a rule doesn't match (due to condition assertions) a user can inspect the pattern matches and potentially discover fragments of malware.

I think that you should make the page size a parameter so it can be tuned at runtime. Bulk_extractor's feature recorder system already does de-duplication of features that are reported twice. This is what the entire 'margin' system is about. Did you look at my implementation for the legacy regular expression matching system? It does most of what you are describing above --- breaking the image into chunks to give to the RE engine.—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

simsong left a comment (simsong/bulk_extractor#487)

Why 4kb chunks and not 16kb or even 64kb?

Fair question. Yara rules are designed to be run against files, not raw data. The rule language reminds me of a certain infamous Richard Gabriel essay... A typical yara rule is propelled by multipattern string search, but then has conditions to evaluate any discovered matches. Those conditions may contain assertions about matches being at various offsets, byte values at various offsets (independent of string/pattern search), and file sizes. It seems common practice for yara rules to make assertions of byte values at/around offset 0 as a poor man's file signature/magic matching routine (there are many rules asserting that the file starts with "PE" so as to apply only to Windows executables). It also seems like common practice to use file size to exclude rule matches on very large files (i.e., most assertions are size < X). So, there's a fundamental mismatch here between yara rules (file-oriented) and bulk_extractor (data-oriented). I think choosing the relatively small 4KB size offers the best chance of discovering potential matches as it's the typical sector size and memory page size. However, it could be that some rules making assertions about file signatures are also expecting their pattern matches much further into the file. We may need to run the rules duplicatively in ~1MB chunks, too, to stand the best chance of triggering rule matches. I hope to get a better idea of these tradeoffs by analyzing a few large open source yara rulebases. I believe it is also possible to get yara to report the raw pattern matches in data, even if the rules don't match. A separate feature recorder can be made for these, so that even if a rule doesn't match (due to condition assertions) a user can inspect the pattern matches and potentially discover fragments of malware.

I think that you should make the page size a parameter so it can be tuned at runtime. Bulk_extractor's feature recorder system already does de-duplication of features that are reported twice. This is what the entire 'margin' system is about. Did you look at my implementation for the legacy regular expression matching system? It does most of what you are describing above --- breaking the image into chunks to give to the RE engine.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

Mar 23 '25 14:03 jonstewart

I don’t disagree about the need to specify a different block size at runtime.One question: is there a way with a feature recorder to provide a string (for example, the associated rule name with a pattern match) but have b_e carve surrounding context automatically?On Mar 23, 2025, at 10:43 AM, Simson L. Garfinkel @.***> wrote:

Yes. Feature recorders can be set up to automatically carve when features are detected. See: https://github.com/simsong/be20_api/blob/872be6233f97db650c85ea13f379e4b9d12bad2b/feature_recorder.h#L70

    /**
     * Carving support.
     *
     * Carving writes the filename to the feature file and the contents to a file in the directory.
     * The second field of the feature file is the file's hash using the provided function.
     * Automatically de-duplicates.
     * Carving is implemented in the abstract class so all feature recorders have access to it.
     * Carve mode parameters were previously set in the INIT phase of each of the scanners. Now it is set for
     * all feature recorders in the scanner_set after the scanners are initialized and the feature recorders are created.
     * See scanner_set::apply_scanner_commands
     */
    enum carve_mode_t {
        CARVE_NONE = 0,    // don't carve at all, even if the carve function is called
        CARVE_ENCODED = 1, // only carve if the data being carved is encoded (e.g. BASE64 or GZIP is in path)
        CARVE_ALL = 2      // carve whenever the carve function is called.
    } default_carve_mode {CARVE_ALL};
    size_t min_carve_size {200};
    size_t max_carve_size {16*1024*1024};

However you may need more flexibility then these three options.

Mar 23 '25 14:03 simsong

Sorry, "carve" is an overloaded word. I simply mean the context extraction in the third column. I'm not sure that would be useful, though.

Mar 23 '25 16:03 jonstewart

Sorry, "carve" is an overloaded word. I simply mean the context extraction in the third column. I'm not sure that would be useful, though.

It's the file format.

Mar 23 '25 18:03 simsong

This will need better error handling and reporting. Probably the most likely use case is for users to point it at a large repository of .yar files, each containing one–many yara rules, and some of those will have syntax errors and not get included.

Mar 26 '25 02:03 jonstewart

I just wanted to add some context here which might be useful. In velociraptor we have a similar problem in that file access is abstracted via an accessor (eg we read the file from the NTFS parser or memory etc) but a lot of Yara rules rely on matches very far apart.

There does not seem to be a way to tell Yara this is a reader object, please use that to read the file. The library either operates on a buffer or a file. In the case of a file the library mmaps the entire file and that allows it to efficiently match strings very far apart.

We have resorted to a short cut when we actually scan files that are real we do delegate them to the library for mmap otherwise many rules that are expected to match don't.

I agree with the fact that the rule has to be written in the context of the scan in mind. People try to use rules intended for files on memory and that fails in exactly the same way as described above.

But I think people don't really want to think about it, they just want to throw random rules at the sample and hope it works.

We have implemented buffer size choice as well for cases when we need to parse in buffers but it's actually not that useful because people don't know in advance their rules and don't really understand what that choice means

https://docs.velociraptor.app/vql_reference/parsers/yara/

Mar 26 '25 02:03 scudette

Thank you, scudette. I agree that most users will want to point this at a directory of rules they've gotten from other sources and not explore buffer sizes, etc. The present implementation first scans in separate 4KB chunks and then does a second scan over the entire 16MB chunk (with additional 1MB of shingle).

I am worried about offset(0) usage requiring alignment, hence 4KB. But you make a good point about requisite patterns being located relatively far apart. It's probably going to require some empiricism to get to the best approach.

It may also be possible to record the raw pattern matches, even if rules don't match. This may be noisier (depending on the rules) but also could turn up something where a rule wouldn't have matched.

Mar 26 '25 03:03 jonstewart

This is exactly the problem we encountered, some rules use the offset as a header check and then look for patterns deep in the file. This works when the file is mmaped but only works on the first buffer if read in buffered mode.

This is the behaviour that was confusing users the most

Mar 26 '25 03:03 scudette

Recording the individual hits is easy by replacing the condition clause with "any of them".

We are starting to apply linting to Yara rules in order to remove low quality rules from large rule sets.

The problem is that many rules are broken and if a user gives a file with thousands of rules we can pass the entire thing to the library to compile and it basically returns an error. We have no idea which rule is broken

So now we actually lint individual rules and do some sanity checking as well

https://docs.velociraptor.app/vql_reference/parsers/yara_lint/

Mar 26 '25 03:03 scudette

@jonstewart & @scudette — it seems like there are a lot of bad rules out there. Perhaps we should use AI to figure out the rule and then rewrite it?

Nov 07 '25 02:11 simsong

This issue can shed more light on the present discussion https://github.com/VirusTotal/yara-x/issues/470

The interesting thing here is that a rule may be broken depending on the mode in which the scanning occurs. If a rule refers to certain functions (like uintXX in the above discussion) it will be broken when in buffer mode but work in file mode.

The above issue introduced a FileScan mode for yarax Golang implementation to make such rules work. But this should also apply to the C++ binding depending on the scanning mode.

Nov 07 '25 03:11 scudette

Thanks, @scudette, that was an interesting and relevant read. If I can characterize your primary concern, it’s that you need to manage memory usage strictly on the user side—makes sense for an endpoint system! And then the secondary concern is, of course, performance.

The proclivity for authors to cargo-cult rules and for investigators to run large corpora of rules “just in case” makes yara an attractive nuisance. A whole bunch of regexps (fixed strings are also regexps; yes yes, I know Aho-Corasick…) get smashed together into one big ball of dismal logic that leaves a processor stalled on memory access and mispredicted branches. It might be a good idea to limit how many rules a user can run in Velociraptor in one hunt, as a forcing-function. It is kind of user-hostile, but avoids bad outcomes.

Clearly I need to finish my llama tool for forensics scanning along with its yara rule converter… The solution is a more-structured rule DSL and the ability to evaluate atomic features beyond brittle Boolean conditions.

Jon

Nov 07 '25 12:11 jonstewart

The issue here is not specifically about regex but about yara functions which are evaluated on the data and how the data is itself managed.

Yara itself has the same issue because the issue is not specifically about Velociraptor or an endpoint scenario but rather how to actually scan a file - ultimately we have to read the file in buffers then apply the rule on each buffer. How large to make the buffer? what happens at the buffer's edge? And finally how to evaluate yara functions which require access to specific offsets in the buffer (like uint16(offset) needs to read a fix offset in the file).

Yara X does not have a way to provide arbitrary data to their function evaluator so it is impossible to evaluate uint16() on a buffer (because the buffer may be in the middle of a file). So Yara X just turns off these functions when scanning in buffer mode.

As a special case when scanning in "file" mode, YaraX can in fact read arbitrary offsets into the file so then, yes these functions do work.

My point above is that to evaluate if a rule is valid of not, requires that we know how it is scanned as well. Scanning by buffers (which I would imagine a carver like BE will have to do) will break pretty much every rule that uses yara functions and modules (like pe, lnk etc which are disabled when scanning in buffers).

This may be unexpected to users but it does make sense - for example, how would a carver evaluate a rule like uint16(0) ? there is no file header to speak of?

As an extension, for BE, you can throw away any and all rules that use yara functions - this is easy to do now because we have an AST parser. This can reduce the number of rules substantially and actually clarify matters a lot.

Nov 07 '25 14:11 scudette