angr icon indicating copy to clipboard operation
angr copied to clipboard

The debug information of binary

Open fpb1386 opened this issue 4 years ago • 6 comments

The debug information is very useful in current disassembly. How to automatically parse the debug information of binary in angr? Could angr obtain source code line number for each instruction according to debug information?

fpb1386 avatar Nov 25 '20 13:11 fpb1386

angr presently does not have this capability. It will, eventually, but it might not be soon. If you want to contribute this functionality for ELF files, you should put it here, exporting it from the DWARF data to a format-agnostic interface that could theoretically also be exported for other formats.

rhelmot avatar Nov 26 '20 01:11 rhelmot

It seems that angr loads DWARF debugging info now. See angr/cle#284. But still, this is only provided on low level by cle.

simple.zip

gcc -g simple.c -o simple

And then in python:

import cle

ld = cle.Loader('simple', load_debug_info=True)
elfe = ld.all_elf_objects[0]
cu = elfe.compilation_units[0]

global_vars = cu.global_variables

# 4425 seems to be the address of main
main_func = cu.functions[4425]
local_vars = main_func.local_variables

line_numbers = elfe.addr_to_line

It would be nice if the DWARF information could be used by angr states. Then state.vars would contain all variables visible in the state. For the simple.c example:

state.vars["i"]    # outputs 15
state.vars["b"]    # outputs 42
state.vars["c"]    # outputs '!'
state.src_location # outputs ('simple.c',11)

Other thoughts/suggestions?

grlks avatar May 30 '22 17:05 grlks

Yes, that would be quite nice :) However, it's incredibly difficult - much more than I want to put the time into myself. The help wanted label stays!

rhelmot avatar May 30 '22 22:05 rhelmot

It seems that angr loads DWARF debugging info now. See angr/cle#284. But still, this is only provided on low level by cle.

simple.zip

gcc -g simple.c -o simple

And then in python:

import cle

ld = cle.Loader('simple', load_debug_info=True)
elfe = ld.all_elf_objects[0]
cu = elfe.compilation_units[0]

global_vars = cu.global_variables

# 4425 seems to be the address of main
main_func = cu.functions[4425]
local_vars = main_func.local_variables

line_numbers = elfe.addr_to_line

It would be nice if the DWARF information could be used by angr states. Then state.vars would contain all variables visible in the state. For the simple.c example:

state.vars["i"]    # outputs 15
state.vars["b"]    # outputs 42
state.vars["c"]    # outputs '!'
state.src_location # outputs ('simple.c',11)

Other thoughts/suggestions?

Just as a note for anyone that runs into the same problem: currently the definition of CompilationUnit's functions and global_variables are static. I'm not sure if this is intended, but this does cause problems when you try analyze multiple binaries (multiple projects) and relies on compilations functions like:

a = cle.Loader('/bin/ls', load_debug_info=True)
b = cle.Loader('/bin/bash', load_debug_info=True)

# now a's functions and b's functions are merged!

In this case, if you want a and b's functions from dwarf separately, you will be in trouble. Sadly I already run into this issue.

Escapingbug avatar Jun 23 '22 07:06 Escapingbug

uh oh. that's not supposed to happen...

rhelmot avatar Jun 23 '22 14:06 rhelmot

Fixed in https://github.com/angr/cle/commit/d2f2405e8b521ffc56566ee45a835c0ba0561042

rhelmot avatar Jun 23 '22 14:06 rhelmot

We have worked on these issue on a fork. It was not that difficult but took quite a time. We created new plugins for the variables state.dvars and p.kb.dvars, which seem to be redundant with the existing angr plugins. But it was easier and quicker without much angr background.

To try it out, make sure you have lks9/cle and lks9/angr. Binary from above.

import angr,cle
p = angr.Project('simple', load_debug_info=True)
init_state = p.factory.entry_state()
simgr = p.factory.simgr(init_state)
simgr.explore (find = 0x40117f)
s = simgr.found[0]

p.kb.dvars.load_from_dwarf()

s.dvars["i"].type.name         # returns 'int'
s.dvars["c"].type.name         # returns 'char'

s.mem[s.dvars["i"].addr].int   # returns a <BV32 0xf>
s.mem[s.dvars["b"].addr].int   # returns a <BV32 0x2a>
s.mem[s.dvars["c"].addr].char  # returns a <BV8 33>
# *Edit* Does the same now:
s.dvars["i"].mem
s.dvars["b"].mem
s.dvars["c"].mem

What works:

  • Play around with tests_src/various_variables.c and binary tests/x86_64/various_variables in lks9/binaries. Be sure to call p.kb.dvars.load_from_dwarf() before using s.dvars.
  • We load the variables and types from DWARF.
  • We can deal with variables with the same name but different variable visibility scopes. So we might have string as global variable, as static variable in a different file, as local variable in a function, as local variable in a (while/for/if) block.

What does not work:

  • ~~We don't have any tests yet, so it is untested!~~
  • ~~There should be some way to plug in the s.dvars["i"].type (which has a name "int" and also a byte size from dwarf) to s.mem[s.dvars["i"].addr].~~ Edit Solved, now we have .with_type, see #3618.
  • ~~Some types are not parsed yet (typedef, union, special handling for string).~~
  • Special variable locations, register variables, implicit variables etc. are not supported.

What is not optimal:

  • Documentation
  • Mentioned redundancy
  • Performance: If we have many variables with the same name (let's say i which is often used in for loops), the lookup might take linear time. Also, we unnecessarily create new instances of SimDebugVariable in the new angr/state_plugins/debug_variables.py which could be cached.
  • CFA computation (needed for the address of local variables) is not be correct everywhere: It computes a wrong address when at the very beginning of a method (but then all local variables are uninitialized anyway).

We have not prepared a pull request yet but it would be nice to get some feedback. Thanks in advance!

Edit: Renamed s.nvariables into s.dvars as suggested by @ltfish below, thanks!

lks9 avatar Nov 18 '22 14:11 lks9

Some feedback:

  • what does the n in nvariables stand for?
  • you could probably make nvariables work the same as mem - accessing it returns a "live" object a la simmemoryview, and you can either assign a value to it or say .resolved to get the bitvector or access whatever metadata you want from it.
  • feel free to make changes to the simmemoryview interface - it's not that complicated

The code, at least on the CLE side, looks nice.

rhelmot avatar Nov 18 '22 15:11 rhelmot

Thanks for the quick reply!

  • We just named it nvariables for new because p.kb.variables already existed. However, it did not seem to have the possibility to have overlapping variable visibility scopes. To get it quickely done, we created the new knowledge plugin nvariables. And we just named the state plugin the same.
  • Yes thanks for the suggestion, we will look into it.
  • Ok, I will open a pull request for .with_type. I am sorry but I have to admit that simmemview is a bit hard to read. There seems to be a lot of nasty python specifics in it and the code is almost undocumented and uncommented. Nevertheless, probably it would be the cleanest way to merge state the plugin nvariables into simmemoryview. But this seems to be a bit too ambitious for now, especially because simmemoryview wants angr types but we have dwarf types from cle.

lks9 avatar Nov 18 '22 17:11 lks9

@lks9 I suppose you will send a separate PR based on https://github.com/lks9/angr. Do you intend to implement variable name matching for angr decompiler?

ltfish avatar Dec 11 '22 09:12 ltfish

Also I would prefer a name that is more descriptive than nvariables. Even dvariables (debug variables) is better than new variables.

How about renaming the plugin to DebugVariableManager and allowing access to it via kb.dvars (.dvariables is too long, and I avoided kb.vars for VariableManager due to vars being a builtin)?

ltfish avatar Dec 11 '22 09:12 ltfish

Thank you, I was a bit afraid and thought it would be much harder to get it merged. So I am really happy now.

Do you intend to implement variable name matching for angr decompiler?

Not at all, but I highly respect the work at the decompiler.

The reason why we wanted to have the debugging info was to check some assertions on recorded program traces. So we are using angr for its symbolic execution and constrain solving. The approach is a bit opposite to run-time assertion checking where the assertions are fixed before program execution, now we can check assertions after program execution. But then it turned out that we needed to get the variable locations from DWARF and that is where we are.

How about renaming the plugin to DebugVariableManager and allowing access to it via kb.dvars (.dvariables is too long, and I avoided kb.vars for VariableManager due to vars being a builtin)?

Thank you very much for the suggestion, we will do that.

@ducorduck Could you do the refactoring? When using an IDE there often is an option "refactor/rename", alternative on linux you can use sed -i "s|nvariables|dvars|g" FILENAME. Also, renaming the files with git mv nvariables/__init__.py debug_variables.py seems to be a good idea.

lks9 avatar Dec 11 '22 16:12 lks9

Not at all, but I highly respect the work at the decompiler.

Awesome. I'll add support for using debug variable names in angr decompiler then.

ltfish avatar Dec 16 '22 19:12 ltfish