capa fixtures: add function to resolve sample shortened name by MD5

          Ok so it looks like we need the opposite of `fixtures.get_sample_md5_by_name` e.g. `fixtures.get_sample_short_name_by_md5` or the like. Let's leave this code as-is for now and I'll open a separate issue to update the fixtures.

Originally posted by @mike-hunhoff in https://github.com/mandiant/capa/pull/1727#discussion_r1300208729

Aug 21 '23 18:08 mike-hunhoff

we have so much scaffold code that does these lookups in fixtures. it's kinda a lot to maintain and also repetitive. i've been thinking maybe we should tear a lot of that down and provide some generic routines that use os.walk to find the requested file.

thoughts?

Aug 21 '23 20:08 williballenthin

yeah, there's a bit of potential for improvements in the test fixtures

Aug 22 '23 07:08 mr-tz

Hey i would like to work on this issue .

Sep 25 '23 16:09 0xAtharv

@0xAtharv sounds great!

our tests should be able to refer to their test files (found in capa-testfiles repo) by md5 hash, file name, or full path. please review the code we currently have in fixtures.py and propose how to simplify it, possibly with a set of functions like get_testfile_by_name/hash/path and get_testfile/workspace/hashes/etc_by_*. i think it's ok for fixtures.py to walk the test files directory once upon startup to collect the file names/paths/hashes that are present. today we hardcode this but i think it's annoying.

Sep 27 '23 14:09 williballenthin

we can have a look at all available files in the capa-testfiles repo and adds them to a list and we can create functions which will just use this list to find the file path :

get_testfile_by_md5()
get_testfile_by_name()
get_md5_by_name()

am i missing anything ? @williballenthin

Sep 27 '23 16:09 0xAtharv

yup that sounds reasonable. make a first attempt and then let's review together and try to migrate a few of the tests.

Sep 27 '23 17:09 williballenthin

Hi @mr-tz and @williballenthin, I am interested in contributing to this issue but am having some trouble understanding what is required. I have read the documentation for pytest.fixture but don't fully understand its use here - where/how are the wrapper functions decorated with pytest.fixture invoked? (I searched for a couple of the wrapper functions elsewhere in the capa repository but couldn't find them being used.) What is the pytest.fixture decorator doing to/for the wrapped functions, and how does pytest.fixture affect the functions' invocations? Also, are the goals of this issue: 1) to reduce the number of pytest.fixture functions, and 2) to reduce the hardcoding in get_sample_md5_by_name and get_data_path_by_name? Are there other goals?

Nov 16 '23 09:11 aaronatp

In fixtures.py we have functions like get_data_path_by_name() or get_sample_md5_by_name(). These use hard-coded values which we would like to improve. These are used by the *_extractor() functions which are used by the tests.

The goals look good, plus see the initial comment to add get_sample_short_name_by_md5 or similar.

One idea mentioned above is to enumerate all test files and generate the data once (similar to collect_samples in lint.py, for example).

Nov 20 '23 08:11 mr-tz

If no one is assigned to this issue. I'd like to draft a PR for it.

Here is a potential candidate for get_sample_short_name_by_md5.

def get_sample_short_name_by_md5(md5) -> str:
   if md5 == "5f66b82558ca92e54e77f216ef4c066c":
       return "mimikatz"
   elif md5 == "e80758cf485db142fca1ee03a34ead05":
       return "kernel32"
   elif md5 == "a8565440629ac87f6fef7d588fe3ff0f":
       return "kernel32-64"
   elif md5 == "56bed8249e7c2982a90e54e1e55391a2":
       return "pma12-04"
   elif md5 == "7faafc7e4a5c736ebfee6abbbc812d80":
       return "pma16-01"
   elif md5 == "290934c61de9176ad682ffdd65f0a669":
       return "pma01-01"
   elif md5 == "c8403fb05244e23a7931c766409b5e22":
       return "pma21-01"
   elif md5 == "db648cd247281954344f1d810c6fd590":
       return "al-khaser x86"
   elif md5 == "3cb21ae76ff3da4b7e02d77ff76e82be":
       return "al-khaser x64"
   elif md5 == "b7841b9d5dc1f511a93cc7576672ec0c":
       return "39c05"
   elif md5 == "499c2a85f6e8142c3f48d4251c9c7cd6":
       return "499c2"
   elif md5 == "9324d1a8ae37a36ae560c37448c9705a":
       return "9324d"
   elif md5 == "a198216798ca38f280dc413f8c57f2c2":
       return "a1982"
   elif md5 == "a933a1a402775cfa94b6bee0963f4b46":
       return "a933a"
   elif md5 == "bfb9b5391a13d0afd787e87ab90f14f5":
       return "bfb9b"
   elif md5 == "c91887d861d9bd4a5872249b641bc9f9":
       return "c9188"
   elif md5 == "64d9f7d96b99467f36e22fada623c3bb":
       return "64d9f"
   elif md5 == "82bf6347acf15e5d883715dc289d8a2b":
       return "82bf6"
   elif md5 == "773290480d5445f11d3dc1b800728966":
       return "77329"
   elif md5 == "56a6ffe6a02941028cc8235204eef31d":
       # file name is SHA256 hash
       return "3b13b"
   elif md5 == "7351f8a40c5450557b24622417fc478d":
       return "7351f"
   elif md5 == "79abd17391adc6251ecdc58d13d76baf":
       return "79abd"
   elif md5 == "946a99f36a46d335dec080d9a4371940":
       return "946a9"
   elif md5 == "b9f5bd514485fb06da39beff051b9fdc":
       return "b9f5b"
   elif md5 == "3db3e55b16a7b1b1afb970d5e77c5d98":
       # file name is SHA256 hash
       return "294b8d"
   elif md5 == "2bf18d0403677378adad9001b1243211":
       return "2bf18d"
   elif md5 == "76fa734236daa023444dec26863401dc":
       # file name is SHA256 hash
       return "ea2876"
   else:
       raise ValueError(f"unexpected md5 hash: {md5}")

Let me know what you think.

Mar 24 '24 00:03 fariss

@s-ff

i think it's ok for fixtures.py to walk the test files directory once upon startup to collect the file names/paths/hashes that are present. today we hardcode this but i think it's annoying.

Mar 24 '24 06:03 williballenthin

note that this is a lower priority issue, if you have a quick fix though, go ahead :)

Jun 04 '24 16:06 mr-tz

capa capa copied to clipboard

fixtures: add function to resolve sample shortened name by MD5

capa
capa copied to clipboard