Triton icon indicating copy to clipboard operation
Triton copied to clipboard

Store symbolic expressions to database and offload memory usage

Open hexpell opened this issue 2 years ago • 3 comments

When processing trace with millions of instructions, memory quick used up for storing symbolic expressions. It's would be nice to have a feature to offload these expressions to an on-disk storage, like a database backend. It can also save processing progress so iterating analysis can reuse the saved data.

hexpell avatar Aug 08 '22 06:08 hexpell

You can use some modes:

setMode(ALIGNED_MEMORY, True)
setMode(CONSTANT_FOLDING, True)
setMode(ONLY_ON_SYMBOLIZED, True)

JonathanSalwan avatar Aug 08 '22 09:08 JonathanSalwan

Yes I'm already using these modes, but still, when there are hundreds of millions instructions to process, memory use can quickly goes out of control. An 8 hour run of 30 million instructions lead to over 100GB memory use.

hexpell avatar Aug 08 '22 17:08 hexpell

Yeah, everything depend on what you symbolize. More you symbolize more you will consume memory and on million instructions, you pay the cost on your memory. The max I did, is 300 million instructions but if I symbolize only what I'm looking for, I can keep my ram below 16 GB consumed. May I know what you are analyzing?

JonathanSalwan avatar Aug 08 '22 19:08 JonathanSalwan

Let me experiment more with the ONLY_ON_SYMBOLIZED mode first. I'm currently hitting some bug, not sure if it's my script or Triton, or my patches to Triton... Need to find some time to do debugging. But I will update here when I know more.

hexpell avatar Aug 13 '22 17:08 hexpell

I think I still need to solve this problem in someway. Especially since now we have symbolic memory model which means the runtime memory consumption could be much higher. Moreover, to run over hundreds of millions of instructions it can take days. If anything goes wrong in the process I don't want my data sits only in the memory as it will be lost and I have to start over.

Having the symbolized expressions and concrete values stored in a DB will not only enable recoverable processing, and also the ability to run a lot of offline analysis without running the symbolization process all again. For example I can explore different slicing of the program respect to different origins without doing a full run of the symbolization process everytime.

Of course we don't want Triton to integrate directly with any database backend but we should have a generic symbolic expression storage interface that can enable the user to implement their own database backends.

Basically the idea is to do symbol lookup through a general interface, where it pass in symbol name or associated register/memory address and the interface would return the symbolic expression object, and it can also set a symbolic expression on the store. The same also applies to concrete values.

So that the default in-memory implementation of that storage interface would just save everything in memory like what Triton is doing today. But an adapter that want to offload memory usage to a database can serialize the symbolic expression and insert it in a database, and offload less frequently used data off the memory, and only reload them from the database when requested.

hexpell avatar Sep 30 '22 02:09 hexpell

image

With the new symbolic memory model and those optimization modes you recommended above, its memory usage quickly goes above 200GB for less than 1M instructions processed in an hour. My machine then black screened and rebooted because running out of memory. So yes some persistent storage backend is quite necessary when running large scale analysis.

hexpell avatar Sep 30 '22 06:09 hexpell

Having the symbolized expressions and concrete values stored in a DB will not only enable recoverable processing, and also the ability to run a lot of offline analysis without running the symbolization process all again. For example I can explore different slicing of the program respect to different origins without doing a full run of the symbolization process everytime.

I totally agree and I would love++ having this. I thought a lot about this feature but unfortunately I do not see a way to serialize and deserialize complexe C++ objects (shared_ptr, weak_ptr, unique_ptr, AST structure, etc.) on disk and then back to memory =(. If you have ideas to solve this issue feel free to discuss :)

With the new symbolic memory model and those optimization modes you recommended above, its memory usage quickly goes above 200GB for less than 1M instructions processed in an hour.

Yes, currently this is the main problem of using the ABV logic. This is due to our SSA form and the way of how we have to use store and load operators from the smt2lib. The store operator returns a new memory state, so our memory state is just a recursive call to store operators like:

ref1 = store(mem_init, xxx, xxx)
ref2 = store(ref1, xxx, xxx)
ref3 = store(ref2, xxx, xxx)
ref4 = store(ref3, xxx, xxx)
ref5 = store(ref4, xxx, xxx)
...

The problem with this concept is that we have to keep all memory expressions and thus almost no expression is freed as no more used. I think in a first step we will finish the implement the array logic and then once it's pretty clear in our mind we can start thinking about how to optimize memory expressions.

JonathanSalwan avatar Sep 30 '22 07:09 JonathanSalwan

Yes I figured the large memory use is caused by keeping all the memory SSA history around. And in fact these history are essentially necessary if I want to do a backward slicing. So I don't think we need to change that but probably make an option to free them if the user doesn't care about those kind of backtracking analysis.

As for the serialization/deserialization problem, since every SSA reference has an unique ID, currently an increasing counter, we can instead use the ID in place of pointers. We can also use pooled allocator to even speed that up. This give us ability to offload SSA expressions and restore them back. What do you think?

hexpell avatar Sep 30 '22 16:09 hexpell