liballocs Do binary instrumentation of allocation functions

Rather than doing a lot of hairy link-time stuff (see tools/allocscompilerwrapper.py) to interpose on allocation functions, it would be better to do it at run time. This should be less fragile, and may benefit from access to run-time type information. It will also avoid the need to relink the target binary if we want to change our list of allocation functions, making the system more convenient to use. That also opens the possibility of inferring a "good guess" about the allocation functions themselves, from looking at the dynamic call tree, so taking away some of the developer effort that is needed.

Allocation functions that are accessed via call/return are the easy case, so I propose to investigate a solution for those first. (The harder cases are alloca and inlined allocation functions.)

The idea is to trampoline-rewrite the entry and exit paths of allocation functions, to call out into stubs. These stubs should simply do the same things that our current ones do (as generated by stubgen.h). Since we have xed in our dependencies anyway, it may be possible to hand-roll a solution. Most allocation functions have at least five bytes' worth of prologue, into which there are no inward jumps. In such a case, all we really need is to identify a 5-bytes-or-more "launch pad" at the start of the prologue, displace those instructions elsewhere (re-relocating them as necessary), replace them with a jump, and append to them a return instruction. To handle the return path, online rewriting of the on-stack return address should be sufficient.

If it gets hairier (e.g. we have to get involved with diverting branches back into the 5-byte displaced chunk), DynInst (https://dyninst.org/) will look more worthwhile. Importing DynInst as a dependency has pros and cons. There is quite a lot of overlap with what we do. My preference, if possible, would be to build it under contrib/ in the form of an archive, from which we can pick only a few functions and hopefully not pull in too much code.

Mar 12 '19 11:03 stephenrkell

Of course, generating trampolines at run time compromises debuggability, or more generally meta-completeness, unless we can generate debug info / metadata for our trampolines. See #16. Ideally we would create the trampolines in a libdlbind-allocated object, and write into it some DWARF info for them. Generating this DWARF may prove a bit tedious, but we could probably hard-code a template and generate from that. Or arguably, our trampolines are just like PLT entries and don't really need their own DWARF.

Mar 19 '19 12:03 stephenrkell

Perhaps an appealing way to do the hooking would be by return-address hooking. But then we need to worry about breaking stack walkers, both from debugging and from our own code. In our own code we can cope with it (make introspective clients see the "illusion" of the real stack). But it would be annoying to break the debugger. I wonder if DynInst has a solution to this.

Mar 19 '19 12:03 stephenrkell

I am leaning towards thinking that we should get rid of caller-side instrumentation altogether. That might work as follows.

only do callee-side instrumentation
insert type info by walking up the stack and finding a call to a classified alloc func, i.e. classified according to both source- (dumpallocs) and binary-level (objdumpallocs) analyses. That supposes we are still requiring users to identify the allocation functions; getting away without that is more for #20 although some kind of bootstrapping based on instrumented built-in allocators (brk, mmap, stack) might just about work.
In general we want the outermost hit, to catch wrapper cases. Or do we? In a wrapper, we expect the in-wrapper site not to be classified. And if there might be allocations during allocation, the innermost classified call seems to be the right one. E.g. SPEC CPU2006's sphinx3 has

void **__ckd_calloc_2d__ (int32 d1, int32 d2, int32 elemsize,
                         const char *caller_file, int32 caller_line)
{
   char **ref, *mem;
   int32 i, offset;

   mem = (char *) __ckd_calloc__(d1*d2, elemsize, caller_file, caller_line);
   ref = (char **) __ckd_malloc__(d1 * sizeof(void *), caller_file, caller_line);

   for (i = 0, offset = 0; i < d1; i++, offset += d2*elemsize)
       ref[i] = mem + offset;

   return ((void **) ref);
}

But if we do this, caller-side instrumentation may need to come back for performance reasons. It basically short-circuits the stack walk: by maintaining __current_allocsite we actually don't need to walk the stack, and can zoom straight to the data we want. Maybe some clever caching can get us something almost as good. Or maybe we can heuristically insert the caller-side instrumentation, at the binary level, after the first time we discover an allocation function by stack walking. We could patch the PLT or maybe the call (for not-via-PLT cases). Indirect calls would still need to be going through the PLT; indirect and not-via-PLT is a problem and might fall back to always walking the stack. Or we could instead do our binary instrumentation at the called entry point. This is less clean because we will CoW-fork the text every time.

Oct 29 '21 17:10 stephenrkell

Perhaps we should first rename __current_allocsite to __outermost_allocsite and use it to terminate the stack walk...

... then always choose the innermost (first) classified call site that we find on the walk?

This means we are sticking with caller-side instrumentation, which we had wanted to eliminate. I guess the first step is to do the above (fixes sphinx3) and then measure the performance drop of skipping the __outermost_allocsite early termination.

Mar 31 '22 10:03 stephenrkell

... then, if the performance hit is a problem, replace caller-side wrapping with binary instrumentation of such callers, performed lazily (on the first stack walk that traverses a frame of the named allocation function). Otherwise just scrap it.

That's still assuming we know the names of allocation functions, or at least of 'functions receiving a value with sizeofness'. If we do a more extensive sizeofness analysis, maybe at link time, we can perhaps eliminate that. It's a pretty standard call graph analysis, but has the usual problem of imprecision around indirect calls (for which we'd probably want to assume 'calls any type-compatible callee').

For the call graph analysis, we'd want a local summary of sizeofness sourced, sizeofness sinked, but also 'passed through' i.e. for any integer argument (or integer value read!), supposing it has sizeofness, which callee arg (or value written!) would receive that sizeofness?

A tricky case would be a cross-DSO call into a malloc wrapper. Though at link time we would detect that a given UND symbol receives some sizeofness. Maybe that is enough. It seems to need a further run-time step, accounting for the link map....

Mar 31 '22 10:03 stephenrkell

liballocs liballocs copied to clipboard

Do binary instrumentation of allocation functions

liballocs
liballocs copied to clipboard