HOL icon indicating copy to clipboard operation
HOL copied to clipboard

ex/mc/graph: RISC-V decompiler truncates binary instruction addresses to 32 bits

Open mbrcknl opened this issue 5 years ago • 1 comments

This issue concerns the RISC-V version of the binary-to-graph decompiler in examples/machine-code/graph.

This is more a feature request than a bug.

When parsing addresses of instructions, the binary-to-graph decompiler truncates the address to the least significant 32 bits. The truncated addresses are used for (at least) two purposes:

  • generating node names for the decompiled graph output.
  • generating PC-relative constants for instructions like auipc.

As far as the graph semantics are concerned, node names are just names, so consistency with the original instruction addresses is not relevant to the decompiler's correctness. Furthermore, if the binary code is position-independent and all fits within one 32-bit segment, then perhaps even the truncation of constants loaded by auipc is benign.

However, the graph-refine component of the seL4 binary correctness toolchain makes some assumptions about how the decompiler handles these addresses. In particular, for access to read-only data, graph-refine uses the address loaded by the auipc instruction to locate the relevant entry in kernel.elf.rodata, which requires the original full address. We have three options to deal with this:

  1. Modify the decompiler to make use of the full address from the ELF binary.
  2. Modify graph-refine to truncate addresses when parsing kernel.elf.rodata.
  3. Post-process the decompiler output to update constants loaded by auipc. This is probably the worst option, but it's the one we've taken so far. :-)

The first option would be the ideal solution, but it's probably also the most effort. I've made some partial progress, and would be happy for this issue to be assigned to me.

mbrcknl avatar Jul 13 '20 08:07 mbrcknl

Yuck.

It's perhaps worth considering that graph-refine might want to support some kind of relocations between the read-only data as addressed in kernel.elf.rodata and as it appears in memory. That would probably happen in the target code, somewhere between parsing the ELF and setting up the various global configuration. That's a bit like option 2 or 3 above, but maybe a bit more principled.

In any case, it's certainly not helping that information is going missing.

talsewell avatar Jul 26 '21 13:07 talsewell