edb-debugger
edb-debugger copied to clipboard
Preserve Analysis through reboot
Analysis can be slow on some very large binaries. It would be nice if we could save the results of the analysis to disk.
To prevent loading incorrect analysis when binaries change, we can compare the analysis timestamp against the binary's filestamp at loadtime, and if the binary is newer we can discard the analysis.
Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.
A hash sum seems more robust than a timestamp.
You're right, but it could be expensive on binaries with lots of embedded resources. Maybe it should be an opt-in.
The analyzer already does an MD5 of every region it analyzes (in particular to detect changes), so that could be used directly. Fortunately, it hasn't proven to be particularly time consuming yet.
I think the first step in this, would be the make the analysis data store addresses relative to the module/region base instead of absolute like it is currently. That would make saving/restoring much simpler when ASLR is involved.
We could do that, but it would probably be better to just save the base address alongside the absolute addresses so that we can do corrections at load-time.
Sure, that would work equally well.
Using the md5 sum to determine whether analysis is still relevant might be a problem for binaries that use relocation tables. Relocation tables will cause a relocated binary to have a different hash each time.
Well, I think that will generally be a problem for any solution that is based on "did the data in this region change". I am of course open to alternatives.
BTW. do you know if my push fixed #528 ?
Any suggestions on what we should serialize? Serializing everything looks impractical since these function objects have vectors of basicblocks which have vectors of instructions. Literally writing all of the instructions into a file sounds pretty redundant and probably slower than real analysis.
It seems like the most benefit would come from reducing the most expensive parts of analysis. It seems like the fuzzy analysis and basic block steps essentially saves every function's start address, end address and reference counts (expensive) and then disassembles and saves all of those functions' instructions (cheap-ish). If we could serialize that expensive information, we might be able to more quickly recreate Function and BasicBlock objects and their disassembled instructions than we normally would.
Does this seem like a reasonable approach? I don't want to get too far off the deep end before I confirm this is reasonable.
I'll take a look at it and get back to you. But we should probably lean more towards the "store too much" over potentially storing too little.