flare-floss build a database of known junk code strings

if our code recovery solution (lancelot or vivisect) fails to identify some code, then we may still display some junk strings that are actually instructions, like

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ .text ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫┃
It<Iu4P                                                             000099c3                  ┃┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ .rdata ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫┃

its likely that this particular instruction sequence is not difficult to recover generally; rather, in this one sample the code analyzer lost trace of a function. therefore, if we recovered code ranges on a large number of programs and matched that up with the strings extracted from the same files, we could build a database of strings that are likely instruction sequences. we could use this database as a fallback to further filter junk strings after the code recovery pass.

Jun 01 '23 10:06 williballenthin

potential strategy: use lancelot (or similar) to recovery code ranges. mask out all the non-code bytes in the input file. then run strings. any string that is emitted is a junk code string. aggregate, count, and index like normal.

Jun 01 '23 10:06 williballenthin

then, evaluate this database against what the code recovery solution actually produces for input files: does it even do a better job than doing a disassembly analysis of the input file?

Jun 01 '23 10:06 williballenthin

What about something like this https://github.com/ergrelet/windiff but only with generic strings?

Jun 02 '23 16:06 Vulcanraven91

interesting. do you mean browsing strings from windows binaries across versions? or something else?

Jun 02 '23 17:06 williballenthin

Yes, you could then also look up when the string occurred for the first time.

Jun 02 '23 17:06 Vulcanraven91

Attached are two JSONL files containing strings

from the .text sections of
thousands of C:\Windows native binaries
occurring 100 times or more

While not perfect, it's an easy approximation (with some obvious FPs).

The files are split into strings of length

4-5 characters and
6 or more characters

This is only due to an extraction approach I took earlier.

text_section_strings.zip

Jun 06 '23 15:06 mr-tz

163ca35 adds a junk code strings database and applies the tag #code-junk for now to compare. Not perfect, but can still help:

Note that the minimum string length here is 4 instead of 6.

Jun 07 '23 13:06 mr-tz

Above and below are using 480ca51ba24be6f3ad72ce5282b28783. Below with min_len = 6. Using a wider set (VT etc.) of samples hopefully provides better results.

Before 2023-06-07_15-34-23_WindowsTerminal

After 2023-06-07_15-32-01_WindowsTerminal

Jun 07 '23 13:06 mr-tz

flare-floss flare-floss copied to clipboard

build a database of known junk code strings

flare-floss
flare-floss copied to clipboard