Triton
Triton copied to clipboard
Store symbolic expressions to database and offload memory usage
When processing trace with millions of instructions, memory quick used up for storing symbolic expressions. It's would be nice to have a feature to offload these expressions to an on-disk storage, like a database backend. It can also save processing progress so iterating analysis can reuse the saved data.
You can use some modes:
setMode(ALIGNED_MEMORY, True)
setMode(CONSTANT_FOLDING, True)
setMode(ONLY_ON_SYMBOLIZED, True)
Yes I'm already using these modes, but still, when there are hundreds of millions instructions to process, memory use can quickly goes out of control. An 8 hour run of 30 million instructions lead to over 100GB memory use.
Yeah, everything depend on what you symbolize. More you symbolize more you will consume memory and on million instructions, you pay the cost on your memory. The max I did, is 300 million instructions but if I symbolize only what I'm looking for, I can keep my ram below 16 GB consumed. May I know what you are analyzing?
Let me experiment more with the ONLY_ON_SYMBOLIZED
mode first. I'm currently hitting some bug, not sure if it's my script or Triton, or my patches to Triton... Need to find some time to do debugging. But I will update here when I know more.
I think I still need to solve this problem in someway. Especially since now we have symbolic memory model which means the runtime memory consumption could be much higher. Moreover, to run over hundreds of millions of instructions it can take days. If anything goes wrong in the process I don't want my data sits only in the memory as it will be lost and I have to start over.
Having the symbolized expressions and concrete values stored in a DB will not only enable recoverable processing, and also the ability to run a lot of offline analysis without running the symbolization process all again. For example I can explore different slicing of the program respect to different origins without doing a full run of the symbolization process everytime.
Of course we don't want Triton to integrate directly with any database backend but we should have a generic symbolic expression storage interface that can enable the user to implement their own database backends.
Basically the idea is to do symbol lookup through a general interface, where it pass in symbol name or associated register/memory address and the interface would return the symbolic expression object, and it can also set a symbolic expression on the store. The same also applies to concrete values.
So that the default in-memory implementation of that storage interface would just save everything in memory like what Triton is doing today. But an adapter that want to offload memory usage to a database can serialize the symbolic expression and insert it in a database, and offload less frequently used data off the memory, and only reload them from the database when requested.
With the new symbolic memory model and those optimization modes you recommended above, its memory usage quickly goes above 200GB for less than 1M instructions processed in an hour. My machine then black screened and rebooted because running out of memory. So yes some persistent storage backend is quite necessary when running large scale analysis.
Having the symbolized expressions and concrete values stored in a DB will not only enable recoverable processing, and also the ability to run a lot of offline analysis without running the symbolization process all again. For example I can explore different slicing of the program respect to different origins without doing a full run of the symbolization process everytime.
I totally agree and I would love++ having this. I thought a lot about this feature but unfortunately I do not see a way to serialize and deserialize complexe C++ objects (shared_ptr, weak_ptr, unique_ptr, AST structure, etc.) on disk and then back to memory =(. If you have ideas to solve this issue feel free to discuss :)
With the new symbolic memory model and those optimization modes you recommended above, its memory usage quickly goes above 200GB for less than 1M instructions processed in an hour.
Yes, currently this is the main problem of using the ABV logic. This is due to our SSA form and the way of how we have to use store
and load
operators from the smt2lib. The store
operator returns a new memory state, so our memory state is just a recursive call to store operators like:
ref1 = store(mem_init, xxx, xxx)
ref2 = store(ref1, xxx, xxx)
ref3 = store(ref2, xxx, xxx)
ref4 = store(ref3, xxx, xxx)
ref5 = store(ref4, xxx, xxx)
...
The problem with this concept is that we have to keep all memory expressions and thus almost no expression is freed as no more used. I think in a first step we will finish the implement the array logic and then once it's pretty clear in our mind we can start thinking about how to optimize memory expressions.
Yes I figured the large memory use is caused by keeping all the memory SSA history around. And in fact these history are essentially necessary if I want to do a backward slicing. So I don't think we need to change that but probably make an option to free them if the user doesn't care about those kind of backtracking analysis.
As for the serialization/deserialization problem, since every SSA reference has an unique ID, currently an increasing counter, we can instead use the ID in place of pointers. We can also use pooled allocator to even speed that up. This give us ability to offload SSA expressions and restore them back. What do you think?