angr
angr copied to clipboard
The debug information of binary
The debug information is very useful in current disassembly. How to automatically parse the debug information of binary in angr? Could angr obtain source code line number for each instruction according to debug information?
angr presently does not have this capability. It will, eventually, but it might not be soon. If you want to contribute this functionality for ELF files, you should put it here, exporting it from the DWARF data to a format-agnostic interface that could theoretically also be exported for other formats.
It seems that angr loads DWARF debugging info now. See angr/cle#284. But still, this is only provided on low level by cle.
gcc -g simple.c -o simple
And then in python:
import cle
ld = cle.Loader('simple', load_debug_info=True)
elfe = ld.all_elf_objects[0]
cu = elfe.compilation_units[0]
global_vars = cu.global_variables
# 4425 seems to be the address of main
main_func = cu.functions[4425]
local_vars = main_func.local_variables
line_numbers = elfe.addr_to_line
It would be nice if the DWARF information could be used by angr states. Then state.vars
would contain all variables visible in the state. For the simple.c
example:
state.vars["i"] # outputs 15
state.vars["b"] # outputs 42
state.vars["c"] # outputs '!'
state.src_location # outputs ('simple.c',11)
Other thoughts/suggestions?
Yes, that would be quite nice :) However, it's incredibly difficult - much more than I want to put the time into myself. The help wanted label stays!
It seems that angr loads DWARF debugging info now. See angr/cle#284. But still, this is only provided on low level by cle.
gcc -g simple.c -o simple
And then in python:
import cle ld = cle.Loader('simple', load_debug_info=True) elfe = ld.all_elf_objects[0] cu = elfe.compilation_units[0] global_vars = cu.global_variables # 4425 seems to be the address of main main_func = cu.functions[4425] local_vars = main_func.local_variables line_numbers = elfe.addr_to_line
It would be nice if the DWARF information could be used by angr states. Then
state.vars
would contain all variables visible in the state. For thesimple.c
example:state.vars["i"] # outputs 15 state.vars["b"] # outputs 42 state.vars["c"] # outputs '!' state.src_location # outputs ('simple.c',11)
Other thoughts/suggestions?
Just as a note for anyone that runs into the same problem: currently the definition of CompilationUnit
's functions
and global_variables
are static.
I'm not sure if this is intended, but this does cause problems when you try analyze multiple binaries (multiple projects) and relies on compilations functions like:
a = cle.Loader('/bin/ls', load_debug_info=True)
b = cle.Loader('/bin/bash', load_debug_info=True)
# now a's functions and b's functions are merged!
In this case, if you want a and b's functions from dwarf separately, you will be in trouble. Sadly I already run into this issue.
uh oh. that's not supposed to happen...
Fixed in https://github.com/angr/cle/commit/d2f2405e8b521ffc56566ee45a835c0ba0561042
We have worked on these issue on a fork. It was not that difficult but took quite a time. We created new plugins for the variables state.dvars
and p.kb.dvars
, which seem to be redundant with the existing angr plugins. But it was easier and quicker without much angr background.
To try it out, make sure you have lks9/cle and lks9/angr. Binary from above.
import angr,cle
p = angr.Project('simple', load_debug_info=True)
init_state = p.factory.entry_state()
simgr = p.factory.simgr(init_state)
simgr.explore (find = 0x40117f)
s = simgr.found[0]
p.kb.dvars.load_from_dwarf()
s.dvars["i"].type.name # returns 'int'
s.dvars["c"].type.name # returns 'char'
s.mem[s.dvars["i"].addr].int # returns a <BV32 0xf>
s.mem[s.dvars["b"].addr].int # returns a <BV32 0x2a>
s.mem[s.dvars["c"].addr].char # returns a <BV8 33>
# *Edit* Does the same now:
s.dvars["i"].mem
s.dvars["b"].mem
s.dvars["c"].mem
What works:
- Play around with tests_src/various_variables.c and binary tests/x86_64/various_variables in lks9/binaries. Be sure to call
p.kb.dvars.load_from_dwarf()
before usings.dvars
. - We load the variables and types from DWARF.
- We can deal with variables with the same name but different variable visibility scopes. So we might have
string
as global variable, as static variable in a different file, as local variable in a function, as local variable in a (while/for/if) block.
What does not work:
- ~~We don't have any tests yet, so it is untested!~~
- ~~There should be some way to plug in the
s.dvars["i"].type
(which has a name "int" and also a byte size from dwarf) tos.mem[s.dvars["i"].addr]
.~~ Edit Solved, now we have.with_type
, see #3618. - ~~Some types are not parsed yet (typedef, union, special handling for string).~~
- Special variable locations, register variables, implicit variables etc. are not supported.
What is not optimal:
- Documentation
- Mentioned redundancy
- Performance: If we have many variables with the same name (let's say
i
which is often used infor
loops), the lookup might take linear time. Also, we unnecessarily create new instances ofSimDebugVariable
in the new angr/state_plugins/debug_variables.py which could be cached. - CFA computation (needed for the address of local variables) is not be correct everywhere: It computes a wrong address when at the very beginning of a method (but then all local variables are uninitialized anyway).
We have not prepared a pull request yet but it would be nice to get some feedback. Thanks in advance!
Edit: Renamed s.nvariables
into s.dvars
as suggested by @ltfish below, thanks!
Some feedback:
- what does the n in nvariables stand for?
- you could probably make nvariables work the same as mem - accessing it returns a "live" object a la simmemoryview, and you can either assign a value to it or say .resolved to get the bitvector or access whatever metadata you want from it.
- feel free to make changes to the simmemoryview interface - it's not that complicated
The code, at least on the CLE side, looks nice.
Thanks for the quick reply!
- We just named it nvariables for new because
p.kb.variables
already existed. However, it did not seem to have the possibility to have overlapping variable visibility scopes. To get it quickely done, we created the new knowledge plugin nvariables. And we just named the state plugin the same. - Yes thanks for the suggestion, we will look into it.
- Ok, I will open a pull request for
.with_type
. I am sorry but I have to admit that simmemview is a bit hard to read. There seems to be a lot of nasty python specifics in it and the code is almost undocumented and uncommented. Nevertheless, probably it would be the cleanest way to merge state the plugin nvariables into simmemoryview. But this seems to be a bit too ambitious for now, especially because simmemoryview wants angr types but we have dwarf types from cle.
@lks9 I suppose you will send a separate PR based on https://github.com/lks9/angr. Do you intend to implement variable name matching for angr decompiler?
Also I would prefer a name that is more descriptive than nvariables
. Even dvariables
(debug variables
) is better than new variables
.
How about renaming the plugin to DebugVariableManager
and allowing access to it via kb.dvars
(.dvariables
is too long, and I avoided kb.vars
for VariableManager
due to vars
being a builtin)?
Thank you, I was a bit afraid and thought it would be much harder to get it merged. So I am really happy now.
Do you intend to implement variable name matching for angr decompiler?
Not at all, but I highly respect the work at the decompiler.
The reason why we wanted to have the debugging info was to check some assertions on recorded program traces. So we are using angr for its symbolic execution and constrain solving. The approach is a bit opposite to run-time assertion checking where the assertions are fixed before program execution, now we can check assertions after program execution. But then it turned out that we needed to get the variable locations from DWARF and that is where we are.
How about renaming the plugin to
DebugVariableManager
and allowing access to it viakb.dvars
(.dvariables
is too long, and I avoidedkb.vars
forVariableManager
due tovars
being a builtin)?
Thank you very much for the suggestion, we will do that.
@ducorduck Could you do the refactoring? When using an IDE there often is an option "refactor/rename", alternative on linux you can use sed -i "s|nvariables|dvars|g" FILENAME
. Also, renaming the files with git mv nvariables/__init__.py debug_variables.py
seems to be a good idea.
Not at all, but I highly respect the work at the decompiler.
Awesome. I'll add support for using debug variable names in angr decompiler then.