capa
capa copied to clipboard
fixtures: add function to resolve sample shortened name by MD5
Ok so it looks like we need the opposite of `fixtures.get_sample_md5_by_name` e.g. `fixtures.get_sample_short_name_by_md5` or the like. Let's leave this code as-is for now and I'll open a separate issue to update the fixtures.
Originally posted by @mike-hunhoff in https://github.com/mandiant/capa/pull/1727#discussion_r1300208729
we have so much scaffold code that does these lookups in fixtures. it's kinda a lot to maintain and also repetitive. i've been thinking maybe we should tear a lot of that down and provide some generic routines that use os.walk to find the requested file.
thoughts?
yeah, there's a bit of potential for improvements in the test fixtures
Hey i would like to work on this issue .
@0xAtharv sounds great!
our tests should be able to refer to their test files (found in capa-testfiles repo) by md5 hash, file name, or full path. please review the code we currently have in fixtures.py and propose how to simplify it, possibly with a set of functions like get_testfile_by_name/hash/path
and get_testfile/workspace/hashes/etc_by_*
. i think it's ok for fixtures.py to walk the test files directory once upon startup to collect the file names/paths/hashes that are present. today we hardcode this but i think it's annoying.
we can have a look at all available files in the capa-testfiles repo and adds them to a list
and we can create functions which will just use this list to find the file path :
- get_testfile_by_md5()
- get_testfile_by_name()
- get_md5_by_name()
am i missing anything ? @williballenthin
yup that sounds reasonable. make a first attempt and then let's review together and try to migrate a few of the tests.
Hi @mr-tz and @williballenthin, I am interested in contributing to this issue but am having some trouble understanding what is required. I have read the documentation for pytest.fixture but don't fully understand its use here - where/how are the wrapper functions decorated with pytest.fixture invoked? (I searched for a couple of the wrapper functions elsewhere in the capa repository but couldn't find them being used.) What is the pytest.fixture decorator doing to/for the wrapped functions, and how does pytest.fixture affect the functions' invocations? Also, are the goals of this issue: 1) to reduce the number of pytest.fixture functions, and 2) to reduce the hardcoding in get_sample_md5_by_name
and get_data_path_by_name
? Are there other goals?
In fixtures.py
we have functions like get_data_path_by_name()
or get_sample_md5_by_name()
. These use hard-coded values which we would like to improve. These are used by the *_extractor()
functions which are used by the tests.
The goals look good, plus see the initial comment to add get_sample_short_name_by_md5
or similar.
One idea mentioned above is to enumerate all test files and generate the data once (similar to collect_samples
in lint.py
, for example).
If no one is assigned to this issue. I'd like to draft a PR for it.
Here is a potential candidate for get_sample_short_name_by_md5.
def get_sample_short_name_by_md5(md5) -> str:
if md5 == "5f66b82558ca92e54e77f216ef4c066c":
return "mimikatz"
elif md5 == "e80758cf485db142fca1ee03a34ead05":
return "kernel32"
elif md5 == "a8565440629ac87f6fef7d588fe3ff0f":
return "kernel32-64"
elif md5 == "56bed8249e7c2982a90e54e1e55391a2":
return "pma12-04"
elif md5 == "7faafc7e4a5c736ebfee6abbbc812d80":
return "pma16-01"
elif md5 == "290934c61de9176ad682ffdd65f0a669":
return "pma01-01"
elif md5 == "c8403fb05244e23a7931c766409b5e22":
return "pma21-01"
elif md5 == "db648cd247281954344f1d810c6fd590":
return "al-khaser x86"
elif md5 == "3cb21ae76ff3da4b7e02d77ff76e82be":
return "al-khaser x64"
elif md5 == "b7841b9d5dc1f511a93cc7576672ec0c":
return "39c05"
elif md5 == "499c2a85f6e8142c3f48d4251c9c7cd6":
return "499c2"
elif md5 == "9324d1a8ae37a36ae560c37448c9705a":
return "9324d"
elif md5 == "a198216798ca38f280dc413f8c57f2c2":
return "a1982"
elif md5 == "a933a1a402775cfa94b6bee0963f4b46":
return "a933a"
elif md5 == "bfb9b5391a13d0afd787e87ab90f14f5":
return "bfb9b"
elif md5 == "c91887d861d9bd4a5872249b641bc9f9":
return "c9188"
elif md5 == "64d9f7d96b99467f36e22fada623c3bb":
return "64d9f"
elif md5 == "82bf6347acf15e5d883715dc289d8a2b":
return "82bf6"
elif md5 == "773290480d5445f11d3dc1b800728966":
return "77329"
elif md5 == "56a6ffe6a02941028cc8235204eef31d":
# file name is SHA256 hash
return "3b13b"
elif md5 == "7351f8a40c5450557b24622417fc478d":
return "7351f"
elif md5 == "79abd17391adc6251ecdc58d13d76baf":
return "79abd"
elif md5 == "946a99f36a46d335dec080d9a4371940":
return "946a9"
elif md5 == "b9f5bd514485fb06da39beff051b9fdc":
return "b9f5b"
elif md5 == "3db3e55b16a7b1b1afb970d5e77c5d98":
# file name is SHA256 hash
return "294b8d"
elif md5 == "2bf18d0403677378adad9001b1243211":
return "2bf18d"
elif md5 == "76fa734236daa023444dec26863401dc":
# file name is SHA256 hash
return "ea2876"
else:
raise ValueError(f"unexpected md5 hash: {md5}")
Let me know what you think.
@s-ff
i think it's ok for fixtures.py to walk the test files directory once upon startup to collect the file names/paths/hashes that are present. today we hardcode this but i think it's annoying.
note that this is a lower priority issue, if you have a quick fix though, go ahead :)