retdec
retdec copied to clipboard
Compiling back LLVM IR
Hi there and congratulations with open-sourcing such a useful tool. :)
As an experiment, I tried to compile back generated *.bc file. And got many undefined reference errors on _asm_smth()
, pseudo_smth()
calls and many unknown_smth
symbols. Is there any future possibility to recompile binaries?
Hi. Good question :-). If I understood you correctly, you would like to decompile a binary file, take the LLVM IR representation of it that RetDec generates (.ll
/.bc
), and compile that LLVM IR back into a binary file via e.g. Clang?
Unfortunately, I think that the answer is that this is generally not possible, mainly because the front-end part of RetDec (bin2llvmir
) may generate a lot of function declarations without definitions. This happens e.g.
- when there is no body of the function in the input binary file (= the function comes from a shared object or DLL),
- when we remove the body through our statically linked code detection (= it is a function from a known static library),
- or we even generate many internal function declarations without definitions ourselves to circumvent some LLVM optimizations to e.g. prevent the removal of variables that we want to keep no matter what.
You can try to insert the bodies of missing functions manually, but I am not sure whether this will work as expected because LLVM's whole-program optimizations may optimize calls to such empty functions away.
Frankly, we never thought of recompiling LLVM IR as the LLVM IR that we generate serves only as a middle layer in the whole decompilation process (binary to C). As for recompilation of the generated C source code, this is a whole other topic :-). In short, producing recompilable C source code is very hard and it is not our main goal. Nevertheless, in some cases, the C source code that RetDec produces is recompilable, which we sometimes utilize in our tests (e.g. this one, which checks that a program doing FPU computations is correctly decompiled, then recompiled, and produces correct output when run).
We hope to provide more information on the whole decompilation process in the future.
Thanks @s3rvac for providing additional details on this topic. I'd also be very interested in having the LLVM IR being self contained to include all necessary information for other tools to make use of (one related issue #38). It would be really cool to use it in combination with symbolic execution tools (e.g. klee), plug it into other static analysis tools, etc.
Specifically for the function declarations missing function bodies, such that linking fails, a few of the bullets you mentioned could be resolved at link time, and others by bin2llvmir
.
- when there is no body of the function in the input binary file (= the function comes from a shared object or DLL),
For these cases it should be possible to link against said library.
- when we remove the body through our statically linked code detection (= it is a function from a known static library),
Similar to the above bullet, it should be possible to link the binary dynamically against the original library for which functions had been statically linked (e.g. clang -o foo foo.c -lc
).
As a side note, my friends and I have played with this specifically, to remove statically linked libraries (e.g. libc) from existing binaries and relinking them dynamically (by injection entries into the .idata
section of the binary) to the corresponding shared libraries. This approach has proved successful, so it is definitely possible :)
- or we even generate many internal function declarations without definitions ourselves to circumvent some LLVM optimizations to e.g. prevent the removal of variables that we want to keep no matter what.
This is the only case where bin2llvmir
would need to be updated, to support later linking of the LLVM IR, and that would be done by allowing the function bodies of these internal functions to be emitted; perhaps guarded by a command line flag to bin2llvmir
or something along those lines.
@a1batross Thanks for submitting this issue, I think it is a great aspirational goal for bin2llvmir
to become usable in a wider sense than specifically as a front-end to the retdec decompiler pipeline. With this in mind, the bin2llvmir and capstone2llvmir could be further developed by other members of the open source community (to include more input file formats, architectures and instruction sets, etc), as more people would be using bin2llvmir
when it supports further use cases.
It should be noted that MC-Semantics, aka mcsema, has the capability of recompiling the LLVM IR to fully functional binary executables, so the idea is not entirely crazy, just a bit out there!
For those who haven't seen it yet, this is a really great presentation of MC-Semantics, which showcases some of its capabilities: https://www.youtube.com/watch?v=nW9bE5tUVYg
I think both bin2llvmir (with capstone2llvmir as backend) and mcsema (with remill as backend) have a lot they may learn from one another. It is very exciting for the reverse engineering and static analysis community to have these great tools now being open source!
Cheerful regards, /u
@s3rvac @mewmew Thanks for such detailed answer! :)
Yes, I meant this. My goal is to dissamble into LLVM IR(or decompile to C code) to bring the same well-written but abandoned code to newer platform. It's mostly Windows x86 libraries compiled by MSVC6. So I want them on Linux x86 or even on Linux ARM, as x86 and ARM have pretty comparable data type sizes. And it's kinda possible, which reminds me to notaz's work on porting the StarCraft from Windows x86 and Linux ARM.
Most of this code just depends on standard C and C++ functions, so removed static library code is a problem solved by linking libc and libm.
I just thought that working with LLVM IR should be easier, but it seems that generated C code is valid. Nice work here! :)
Yes, I heard about mcsema
, but not tried it yet. If I will have any success with recompiling IR, I will post an update here maybe.
I tried to compile the C code and have two question. It seems that __pseudo_call(x) is just a calling a function by it's pointer in x
. And that __pseudo_cond_branch(x, y) is a if( x ) jmp y;
. It's created by decompiler and am I right with their purpose?
My two cents ...
As @s3rvac wrote, so far, it has never been our goal to produce LLVM IR or C that could be recompiled. The goal was to produce something, that could be analyzed by a human. State at the moment:
- RetDec's output quality is nowhere good enough to be even thinking about working recompilation.
- As @mewmew wrote, even if quality was not an issue,
bin2llvmir
does not currently produce the kind of output that could be recompiled.
I think the second point could be decently solved in the future. We would certainly like to make it possible to hook the output to klee (or other LLVM tools) and try to do something like Ponce on LLVM IR.
The first point will be an ongoing issue forever. Yes, we will be continually improving the output quality. No, I'm not sure static analysis will ever be good enough to really reliably recompile reasonably complex real world programs. But who knows.
@a1batross, definitely check out already mentioned mcsema
. rev.ng might be even better for your purpose -- I would be interested how/if it worked, so if you try it out, please write your feedback here or to my email.
Sadly, but mcsema depends on IDA or Binary,Ninja. These two tools are paid and I don't want to talk about piracy, at least publicly. If I understood it right, mcsema just need a real disassembler. I don't know why developers don't tried radare2, maybe it just doesn't fit for their needs.
Thanks for pointing me to rev.ng, I will try it out in my free time.
Sadly, but mcsema depends on IDA or Binary,Ninja. These two tools are paid and I don't want to talk about piracy, at least publicly. If I understood it right, mcsema just need a real disassembler. I don't know why developers don't tried radare2, maybe it just doesn't fit for their needs.
MC-Semantics had implemented their own recursive descent disassembler that could be used in place of the IDA plugin. However, bindecent was removed as they could not allocate enough resources to keep up with both implementations. (for reference, bin_descend
was removed in rev https://github.com/trailofbits/mcsema/commit/4864d9c999744f0eb447d4aea78848aafb66d078)
That being said, the control flow graph is stored in a Protocol Buffers format, and it would be easy to output the disassembly as protobuffers, if you already have a working recursive descent disassembler.
Oh, and rev.ng is definitely worth giving a shot!
Hi. Is there a tool or reference of translation arm aasembly code to llvm-ir? best regards.
Is there a tool or reference of translation arm aasembly code to llvm-ir?
@chlizheng A list of tools translating binaries to LLVM IR has been summarized at https://github.com/decomp/decomp/blob/master/front-end.md
You may want to look at llvm-mctoll, rev.ng, bin2llvm, all of which translates ARM to LLVM IR.
There are also two tools that have explicitly stated this as a future goal. Have not looked at them in a while, so may be worth checking out.
- dagger stated as a future goal to translate ARM to LLVM IR.
- OpenREIL translates ARM to REIL. They stated as a future goal to translates REIL to LLVM IR.
Edit: Oh, forgot to list RetDec. RetDec also has support for ARM (32-bit).
Cheers! Robin
@mewmew Thanks very much. I'll check out them. Ask more thing. Actually, I want to translate arm to RISC-V, but I don't look anything about this.So I just prepare to do it by two step. Does arm to RISC-V be tried now?
Ask more thing. Actually, I want to translate arm to RISC-V, but I don't look anything about this.So I just prepare to do it by two step. Does arm to RISC-V be tried now?
This issue tracks precisely this. To translate a binary to LLVM IR, and then back again to a binary.
So:
- ARM -> LLVM IR
- use any of the tools listed above for lifting LLVM IR to ARM
- LLVM IR -> RISC V
- use Clang to compile from LLVM IR to RISC V. I think Clang targets RISCV, but not sure.
@mewmew Thanks.If I want to translate arm to llvm-it bymyself for a try, is it possible? Is this job easy and is there some reference about this job? Now I just learn a little by llvm-ir langguage reference.
Thanks.If I want to translate arm to llvm-it bymyself for a try, is it possible? Is this job easy and is there some reference about this job? Now I just learn a little by llvm-ir langguage reference
@chlizheng When you write translate arm to llvm-ir by myself do you mean by hand? If so, that would take ages. The tools listed above are written specifically for this task, and especially to take corner cases into consideration (e.g. side effects such as the setting of flags, etc).
If you are asking whether the above tools are easy to use, I would have to leave that answer up to someone else. My personal experience with these kind of tools is that they are great when they work, and quite annoying when they don't. You can spend hours (or days/weeks/months) tracking through aspects of the code, trying to implement those parts that are not yet implemented. This is more of a heads up than a warning. If you want a quick easy fix to a problem, this is not it. If you want to learn more of how things work underneath. Then by all means, you are warmly invited. Down here, there is joy enough to share :)
Cheers, Robin
@mewmew HI. I try to use the clang and llvm tools. Then I use ‘clang -S -emit-llvm hello.c’ to get hello.ll. And then use 'clang -c -emit-llvm hello.ll' to get hello.bc. Finally, I use 'llc -march=arm -mcpu=cortex-m3 -mattr=v7' to translate it to arm assembly code.
But llc can't do it and make a error. The error is "llc : hello.bc : error:Invalid type for value". Do my method have problem or the tools have problem. Do you know about this?
best regards.
Now it's possible to compile it back, but not link.
-
RetDec fails to decompile pointers to functions. For example, I have a global structure filled with pointers to functions. Just simple interface in C. But RetDec leave them as
unknown_%i
. -
Seems that RetDec's disassembler or other part of RetDec doesn't work with statically allocated data. I don't think it's even possible to know how much memory should be allocated. So it will not work. For example, some global buffers like
T buf[size]
just left as pointers. They are not initialized, so you cannot work with that. -
If you are recompiling something was compiled by MSVC on Linux, you may encounter that Linux linkers doesn't like MSVC mangling. However, you can just replace all '@' with something else, like '_' and it will work.
The LLVM code emitted by retdec is incomplete and doesn't reflect the asm code, it often skips the whole sequencies of instructions.