flare-floss
flare-floss copied to clipboard
build a database of known junk code strings
if our code recovery solution (lancelot or vivisect) fails to identify some code, then we may still display some junk strings that are actually instructions, like
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ .text ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫┃
It<Iu4P 000099c3 ┃┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ .rdata ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫┃
its likely that this particular instruction sequence is not difficult to recover generally; rather, in this one sample the code analyzer lost trace of a function. therefore, if we recovered code ranges on a large number of programs and matched that up with the strings extracted from the same files, we could build a database of strings that are likely instruction sequences. we could use this database as a fallback to further filter junk strings after the code recovery pass.
potential strategy: use lancelot (or similar) to recovery code ranges. mask out all the non-code bytes in the input file. then run strings. any string that is emitted is a junk code string. aggregate, count, and index like normal.
then, evaluate this database against what the code recovery solution actually produces for input files: does it even do a better job than doing a disassembly analysis of the input file?
What about something like this https://github.com/ergrelet/windiff but only with generic strings?
interesting. do you mean browsing strings from windows binaries across versions? or something else?
Yes, you could then also look up when the string occurred for the first time.
Attached are two JSONL files containing strings
- from the
.text
sections of - thousands of C:\Windows native binaries
- occurring 100 times or more
While not perfect, it's an easy approximation (with some obvious FPs).
The files are split into strings of length
- 4-5 characters and
- 6 or more characters
This is only due to an extraction approach I took earlier.
163ca35 adds a junk code strings database and applies the tag #code-junk
for now to compare. Not perfect, but can still help:
Note that the minimum string length here is 4 instead of 6.
Above and below are using 480ca51ba24be6f3ad72ce5282b28783. Below with min_len = 6. Using a wider set (VT etc.) of samples hopefully provides better results.
Before
After