capa main: add option to ignore rule cache

capa's rule caching is great but not obvious. This caused a huge headache when debugging #1897 as the problem code was skipped entirely when capa used its local rule cache. I suggest we add a command-line option like --no-rule-cache to make it easier to disable the cache for situations like this. Otherwise, debugging code related to rule parsing requires finding (via the --debug option) and deleting the rule cache between subsequent executions.

Dec 09 '23 00:12 mike-hunhoff

first off, i'm sorry that you were bitten by this! i can only imagine that was pretty annoying to waste time on.

i'm a little hesitant that we should add a new cli argument for this, since (ideally) no capa user would ever provide the flag. the cache detects changes to rule content but not source code content. the flag would only be relevant to capa developers that change capa logic (such as rule parsing).

could we instead disable the cache when running from source (eg. when installed by pip install -e .) and/or when run with --debug? or, if in source mode, use a hash of the capa source to derive the cache key?

Dec 09 '23 06:12 williballenthin

This also got me before so the idea is good. I agree with Willi that another CLI argument should be avoided (plus I don't think I necessarily would remember it anyway). So, some automatic handling like also inspecting the hash of rule-related files sounds good.

Dec 09 '23 06:12 mr-tz

Maybe we could introduce a new envrionement variable (e.g. DISABLE_CAPA_CACHE=1) instead of the CLI argument?

@williballenthin's suggestion is also good. We could modify compute_cache_identifier to compute the cache ID not only based on the capa version and rules content, but also by including the hash of the source files.

This way, whenever the capa source code changes, the cache identifier will be different, and the existing cache will be invalidated. A new cache will be created the next time cache_ruleset is called. The only caveat (i.e. performance downgrade) here could be that we have to read in the source files to compute their hash. What do you think? I can draft a PR to test this out.

May 28 '24 00:05 fariss

I'm not sure how to compute the set of file names that are used as source code, and I'm hesitant about getting bogged down figuring that out. If it's easy, then I'm ok exploring this a bit more.

I wonder if there's some way to interact with the Python interpreter's cache (pyc files) and derive the info that way.

Or could we use git status of the source repository?? Maybe this is simplest.

Anyways, I'm not sure this is the behavior that I want, since I may edit capa source dozens of times per day, and I don't think I want a new cache for each one. Maybe we could print a big red warning when the situation is detected?

May 28 '24 04:05 williballenthin

Basically for source code, I was thinking about focusing on the *.py files.

Here is an example:

import hashlib
from pathlib import Path

def compute_cache_identifier(rule_content: List[bytes]) -> CacheIdentifier:
    hash = hashlib.sha256()

    # note that this changes with each release,
    # so cache identifiers will never collide across releases.
    version = capa.version.__version__

    hash.update(version.encode("utf-8"))
    hash.update(b"\x00")

    # Add the hash of the source files
    source_dir = Path(__file__).parent.parent
    source_files = list(source_dir.rglob("*.py"))
    for source_file in source_files:
        with open(source_file, "rb") as f:
            source_content = f.read()
        hash.update(hashlib.sha256(source_content).digest())

    rule_hashes = sorted([hashlib.sha256(buf).hexdigest() for buf in rule_content])
    for rule_hash in rule_hashes:
        hash.update(rule_hash.encode("ascii"))
        hash.update(b"\x00")

    return hash.hexdigest()

I believe this will introduce unnecessary overhead each time a user edits a file and re-runs capa, it will be noticable.

Or could we use git status of the source repository?? Maybe this is simplest.

git sounds like a good way to track changes, just unsure about how practical it is.

Anyways, I'm not sure this is the behavior that I want, since I may edit capa source dozens of times per day, and I don't think I want a new cache for each one. Maybe we could print a big red warning when the situation is detected?

We can. We just need to compute the hash using one of the aforementioned methods and alert. Users can then choose to ignore the warning, and generate the cache on-demand when needed.

May 29 '24 02:05 fariss

git sounds like a good way to track changes, just unsure about how practical it is.

I understand the case we're trying to handle is that devs change source code in a way that invalidates the rules cache and it confuses them. So we can assume that this scenario involves a dev, and therefore git is present. And furthermore, we can rely on git to report the files that are tracked and have been modified, and only hash those ones.

This avoids the problem of inadvertently including irrelevant files in the hash.

May 29 '24 07:05 williballenthin

See https://github.com/mandiant/capa-rules/blob/master/.github/scripts/create_releases.py for an example usage of git in one of our scripts.

May 29 '24 09:05 mr-tz

I find this command to be suitable to our need:

git ls-files --deleted --modified --exclude-standard --full-name --deduplicate -v               
R removed.txt                                   <- file was removed (rm removed.txt)
R renamed.txt                                   <- file was renamed (mv tracked.txt renamed.txt)
C capa/rules/cache.py                           <- file was modified (vim cache.py)

Then we can filter out the deletions (marked as R). This will leave us with tracked, and modified files only.

Jun 04 '24 00:06 fariss

Looks great!

We'll also want to incorporate the git commit hash.

This is shaping up well.

Jun 04 '24 04:06 williballenthin

We'll also want to incorporate the git commit hash.

Yeah, this should help track committed changes which may affect rules / the cache.

Jun 04 '24 09:06 mr-tz

As another alternative, can we compare the timestamps of capa/rules/cache.py vs. the most recent cache and print out a warning that this may result in unexpected behavior. We should keep this simple and little intrusive.

Jun 04 '24 16:06 mr-tz

Regardless of the solution discussed after my initial message it appears that we'll still need to introduce a CLI argument, environment variable, etc. to control when the solution is executed. Otherwise, we'll be introducing overhead to all future invocations of capa just to handle a small use case where a developer may not want the cache to confuse their development?

Jun 04 '24 17:06 mike-hunhoff

Otherwise, we'll be introducing overhead to all future invocations of capa just to handle a small use case where a developer may not want the cache to confuse their development?

I think we can guess that we might be running in a dev environment very quickly:

not PyInstaller, which is by far the most common way to invoke capa, and
inspecting capa.main.__file__ doesn't contain site-packages, which should be very quick. (this might take some real world testing but i think the idea will work)

If these pass, we can then look for the .git directory (slightly slower) and then do the strategies already discussed (which will be fairly slow, but still only like 0.25s or so).

Therefore, I think it may still be possible to enable this for all runs, assuming we order the checks correctly.

Jun 04 '24 18:06 williballenthin

Please note that the auto-cache generation approach will leave users will a lot of stale cache files in the cache dir.

Jun 05 '24 21:06 fariss