cle The Magnificent AutoBlob(tm): Use heuristics, magic tricks, binwalk, etc to automate blob loading

Do you work a lot with blobs? I sure do, and I can't stand what a pain it is to use the Blob backend.
There's been a lot of work out there on automatically figuring out what your blob is. We can and should leverage it, to make this a mostly-automatic process.

But what if your blob is not a blob? What if your blob is in fact a collection of binaries? (e.g., a filesystem image, or archive) Tools exist for this too, and we can use them to more accurately construct the environment of the target.

This is a tracking issue for the effort to implement this functionality as a CLE backend, so that we can handle as much as possible automagically. Short-term goals include support for "clean" blobs, that is, in-tact blobs ready to flash to a microcontroller, but longer-term we'd like to support nasty stuff, like code extracted from real systems in a not-very-precise way, ripping the code right out of a Windows flash tool binary, and so on. Other high-level goals include support for filesystem images, where the environment is pre-constrained based on libraries, etc that exist inside.

Want to help? Working on this doesn't require much/any knowledge about the rest of the angr family of libraries, so it's great for those newer to the project. Read on...

Currently, CLE will try all possible backends until it finds one that works. (or fails, and tells you to go try the Blob loader manually) AutoBlob should be able more or less guess the parameters that would be needed to use Blob, or some other backend, and load the binary for you if it can.

While many backends load all kinds of fancy metadata, symbols, and what have you, you always need at least these three things around, in order to load a binary:

Architecture: Obviously you need to know what arch and endness the binary is. The more specific we can get, the better. (note: some day we will have better embedded architecture support, and being "more specific" to a sub-architecture will be actually important)
Base address: Where in memory does this blob belong? (but if we are sure it won't matter, we can make one up)
Entry point: What should angr consider to be the beginning of the program?

But how do we get those? Blobs are not all as opaque as we may think. There are typically bits and pieces of metadata around that we may be able to find and use. Let's give an example: ARM microcontroller blobs, that is, totally unpacked, fully-formed, clean ARM blobs, will contain, somewhere, the initial Interrupt Vector Table (or Exception Vector Table, if you will). If you're lucky, and your blob is extra clean, it'll be the very first thing in the file. This gives us a surprising amount of information; the very first word in the table is actually the initial stack pointer, which will fall into the normal physical address range for RAM (the low 0x20000000's). The endness of this value tells us, of course, the endness. Let's also assume if we see a pointer in about the right place that falls in that range using either endness, that it's an ARM blob (but we will refine this guess later) The second word is the IRQ handler for IRQ 1, "reset", which is taken always at power-on. That's our entry point! We're done here. MSP430 binaries have a similar thing. although this time, the IVT is at the very top end of memory, and goes backward. Same deal there.

IVT's aren't the only way to figure it out; angr has two existing analyses (girlscout and boyscout) which help with this kind of identification. There is a lot of uncertainty as to how well they work, and they can probably stand to be retrofitted, cleaned, and reapplied inside of CLE.

There's also the very popular tool binwalk, a python library/utility/thingy for carving stuff out of other stuff in a pretty reasonable way. It's a bit more exciting than just libmagic on steroids, there's disassembly-based analysis, specific support for weird firmware formats, compression, etc etc. There's even a pile of other tools based on binwalk (like firmware_mod_kit) for playing with firmware. Can we leverage any of this to make educated guesses about our blobs, or even unpack entire filesystems?

Here's what we need to do:

Implement the initial AutoBlob backend. [DONE]
Fix it so that we don't call autodetect_initial() twice when loading a binary
Create a framework for adding identification functions. We will divide our "magic tricks" into two flavors: initial and secondary. initial means that this is performed to merely get the blob into CLE at all; we use this as part of the is_compatible() method for this backend to tell if AutoBlob will even work on the blob. We call it again in the constructor to actually load the thing. "secondary" techniques let us refine our initial guesses, such as detecting sub-architectures, reorganizing the memory map (what's code, what's data, what goes where), or maybe detecting the OS (is this blob actually VXWorks? Great, we should know that) [DONE]
Add boyscout as an initial technique [DONE]
Add cpu_rec as an initial technique [DONE]
Fix cpu_rec so that it's a real python library
Add girlscout as a secondary technique
Re-implement girlscout so that it's better and modern (check with Paul, this may be done)
Modify CLE's backend registration scheme to be list-based (so AutoBlob can be at the end, always) See cle/backends/init.py
Add binwalk as a last-ditch catchall approach, and for filesystem / archive support Binwalk is really a python library underneath, and my initial exploration showed that this is totally doable. We may want to filter our results to account for the fact that binwalk will generate lots of false-positives, particularly for instruction matching (I bet /dev/urandom produces valid ARM instructions most of the time)
Add support for extracting / mounting binwalk'd filesystem archives and using them as the basis for a concrete FS / libraries in an angr project

Here is the current implementation: https://github.com/subwire/autoblob It is currently implemented as an out-of-tree CLE backend, but we may merge it in later if the end result is well-behaved enough.

Oct 15 '17 22:10 subwire

Update: Project moved out of CLE (because, hey! we can do that now and it's OK) http://github.com/subwire/autoblob This lets us have nasty dependencies and not care. Also, I got some inspiration and banged out a bit more of this. Boyscout has been ported out of angr as "cubscout" but needs testing. cpu_rec (https://github.com/airbus-seclab/cpu_rec) is now included, as the far-superior-but-super-heavy alternative to cubscout.

I looked into girlscout, and it's an absolute mess at the moment after the wip/the_end_times refactor. This is, unfortunately, also the more important part; we have plenty of ways to detect architecture, but base and entry are a lot harder, and this is the most sophisticated general method I'm aware of

Oct 17 '17 17:10 subwire

I'm curious what the current status of this is--I would be interested in taking a look at this. Is there still work being done on this/what is the status of autoblob? I noticed that the autoblob repository has had a few updates since this was last updated.

Do you have any thing that you would like help on in particular, or has this been shoved to the side?

Jun 30 '20 20:06 desertsagebrush

@Wmyers559 thanks for reaching out -- the answer is complicated, so I'll try to be brief:

Basically, we intended this to hold a lot of heuristics, parsing routines, library integrations, and so on, adding them as needed. That "as needed" part ended up being key because, the first few heuristics I added were the only ones I needed for my research upto and including now :) Instead of solving the generic problem of de-blob-ing firmware images like we intended, we just opted to work with simpler firmware for the time being (e.g., those with generic patterns we could parse)

There are some research problems here, as well as engineering tasks, which would benefit the community greatly if solved.

Particularly, automatic base address and entry point detection is an elusive problem in the generic case, with numerous attempts (search angr's issues for girlscout) all failing so far. This is "the big one" for firmware stuff.
Fixing cpu_rec so that it behaves in this context is a low-hanging fruit that would help in a majority of cases, where getting the right architecture is all that's needed. It's not perfect, but it's pretty good.
Support for a rapid binary parsing library (like Kaitai Struct and similar) would be awesome, for those nasty custom cases that can't be handled fully automatically. Heck, we could port the few heuristics we have to something like that.

I will also point out that we tried to tame binwalk for this kind of thing, and failed miserably. Handling blob firmware with binwalk in an automated fashion is not particularly effective in the presence of dense instruction formats like ARM :)

In any case, lots to do, so if you're interested, find me on angr slack for details :)

Jul 08 '20 23:07 subwire

Sounds good -- I'll chat with you there.

Jul 13 '20 14:07 desertsagebrush

This issue has been marked as stale because it has no recent activity. Please comment or add the pinned tag to prevent this issue from being closed.

May 22 '22 02:05 github-actions[bot]

Out of scope.

Oct 26 '22 22:10 zwimer

cle cle copied to clipboard

The Magnificent AutoBlob(tm): Use heuristics, magic tricks, binwalk, etc to automate blob loading

cle
cle copied to clipboard