anvill icon indicating copy to clipboard operation
anvill copied to clipboard

Upstream and downstream type propagation

Open pgoodman opened this issue 3 years ago • 2 comments

Anvill does a limited amount of what could be charitably described as type propagation. It is mostly centred around the GetPointer function, and what it invokes. The merging of #62 means that, besides the address operands to memory read/write intrinsics, we have another jumping off point for type information.

The goals of upstream and downstream type propagation is to lift low-level operations into a slightly higher level. Here are some examples:

Problem examples

Downstream rewriting

Suppose we have used hinted type information to recover the following:

%y_ptr = __anvill_type_...(%y_orig)
%y = ptrtoint %y_ptr
%x = add %y, 8

Then our goal is to produce the following:

%y_ptr = __anvill_type_...(%y_orig)
%x_ptr = getelementptr %y_ptr, 0, N
%x = ptrtoint %x_ptr  ; Add to work list for processing

What is gained by this transformation, what is N, and how is this an improvement? This depends on the pointer type returned by __anvill_type_.... If it's a pointer to a structure, then N might be the index of one of the elements of the structure, so long as there is an element in that structure at byte offset 8. If it is a pointer to pointer, then N might be 1 on a 64-bit architecture, or 2 on a 32-bit architecture, i.e. 8 / sizeof(void *), telling us what index of an array we're accessing. For non-pointer types we can do similar arithmetic to try to convert the byte offset 8 into an array index. When this isn't possible, we can bitcast to a type and convert to GEPs, if we can get some better downstream type information.

Up-and-around rewriting

In x86 especially, the following pattern is common in function prologues:

push eax
push ebx
...
push ...

In bitcode, this might manifest as the following:

%esp = load %state, 0, ...  ; state->gpr.rsp.dword
%eax = load %state, 0, ... ; state->gpr.rax.dword
%ebx = load %state, 0, ... ; state->gpr.rbx.dword
...
%esp_less_4 = sub %esp, 4
%memory.0 = __remill_write_memory_32(%memory, %esp_less_4, %eax)
%esp_less_8 = sub %esp, 8
%memory.1 = __remill_write_memory_32(%memory.0, %esp_less_8, %ebx)
...
store %esp_less_N, ... ; state->gpr.rsp.dword -= N

From the perspective of GetPointer, we can observe that %esp_less_4 and %esp_less_8both need to bei32 *. These two values share a common data dependency at their root, and a common base type. Ideally, we'd like be be able to treat %espas ai32 *so that we can usegetelementptr` with negative indices to index into the stack, and then treat the prologue like writing into elements of an array. This can be done locally with an aggressive algorithm, or perhaps via some more global means.

Other examples

In practice, I have just looked at bitcode, looked for integer arithmetic that eventually leads to pointer casts, and tried to re-imagine how they could be lifted to a slightly higher level of abstraction.

What is needed

What is needed to work toward a solution to this problem? The current state of the codebase is just a mishmash of complex, recursive functions that look for fixed patterns -- it's not extensible. I don't know what the right solution is just yet. In theory, if we have c++ codegen for dr. lojekyll, then maaaaybe that could be useful, but that might be trying to fit a square peg into a round hole. If we had MLIR as an intermediate representation then I'd say we should use its transform infrastructure. I think we should also look into how people have tried to automate instcombine rules. There's probably a whole lot of work here.

Thus, we have two problems:

  • How can we effectively do these transforms
  • What are the transforms that we want to do

pgoodman avatar Jan 22 '21 20:01 pgoodman

I think we should also look into how people have tried to automate instcombine rules. There's probably a whole lot of work here.

Yeah, there's defintiely a ton of work done in llvm::InstCombinePass and adjacent classes / passes. I think at heart the pass is a subclass of llvm::InstVisitor. The rules are often hardcoded into the Visit* methods, but some also make use of llvm::PatternMatch to factor out some tree-based pattern matching. The pattern matching and rewriting is probably not as robust as MLIR's DAG-to-DAG rewriting, but I think it could serve our purposes pretty well.

I probably wouldn't recommend doing too deep of a dive into llvm::InstCombine, since it's huge and can be hard to navigate, but other subclasses of llvm::InstVisitor illustrate working with pretty well.

Finally there's llvm::AggressiveInstCombinePass which uses a bit more llvm::PatternMatch and tries to combine more complex instruction patterns. The drawback is that it's more expensive than regular llvm::InstCombinePass and shouldn't be run as often, or in loops.

surovic avatar Feb 02 '21 17:02 surovic

@carsonharmon I think some progress has been made on these? Curious if its still relevant.

artemdinaburg avatar Feb 19 '21 16:02 artemdinaburg