bytecode icon indicating copy to clipboard operation
bytecode copied to clipboard

Provide a way to convert frame.f_lasti into an instruction

Open eric-wieser opened this issue 7 years ago • 8 comments

Here's the situation - you want to find the bytecode that was just executed in a frame, possibly in a trace func. Here are a pair of things you could want:

  • The current ConcreteInstr object corresponding to f_lasti:

    def get_concrete_index(concrete_bc, code_index):
        at = 0
        concrete_index = 0
        for c in concrete_bc:
            at += c.size
            if at < code_index:
                concrete_index += 1
        return concrete_index
    

    Which can be used as:

    concrete_bc = ConcreteBytecode.from_code(frame.code)
    ci = get_concrete_index(concrete_bc, frame.f_lasti)
    concrete_instr = concrete_bc[ci]
    
  • the current Instr object corresponding to ci or flasti:

    def promote_concrete_index(bc, concrete_index):
        index = None
        at = 0
        for i, b in enumerate(bc):
            if at == concrete_index:
                index = i
            if isinstance(b, bytecode.instr.BaseInstr):
                at += 1
        return index
    

    Used as

    bc = concrete_bc.to_bytecode()
    i = promote_concrete_index(bc, ci)
    instr = bc[concrete_bc]
    

  1. Does this code look correct for all cases?
  2. Does this make sense as a library addition? If so, how are these operations best exposed in the API?

eric-wieser avatar Nov 24 '16 23:11 eric-wieser

Context: writing withhacks.frame_utils.extract_code

eric-wieser avatar Nov 24 '16 23:11 eric-wieser

Hi, bytecode provides two very different kind of instructions: abstract instructions (Instr) and concrete instructions (ConcreteInstr). Which one do you want?

Building a bytecode objects from the same code is trivial, as you showed: ConcreteBytecode.from_code(frame.code)

The problem is for abstract bytecode: an abstract instruction is strongly linked to an abstract bytecode object, for example jumps uses label objects stored in the bytecode object. frame.f_lasti is hard to compute from an abstract bytecode, because you have to assemble the abstract bytecode to concrete bytecode, resolve jumps, etc.

So maybe we should limit ourself to concrete bytecode. I suggest that you add a method to ConcreteBytecode to get an instruction by its offset: return None if the offset is not exactly the start of an instruction, raise an IndexError if the offset is negative or out of the code.

Why not returning directly the instruction in your promote_concrete_index() function?

vstinner avatar Nov 25 '16 08:11 vstinner

Which one do you want?

Ideally, I want to convert frame.f_lasti into an index for Bytecode.from_code(frame.f_code), but the easiest way to do that seems to be to go via an index into ConcreteBytecode.from_code(frame.f_code) first

frame.f_lasti is hard to compute from an abstract bytecode

I agree. But in this case, we have the intermediate concrete bytecode to work with too. What I'm looking for in my second bullet point is a mapping between ConcreteInstr and Instr objects - am I correct in assuming that the number of instructions in b and c where b = c.to_bytecode() are always the same?

suggest that you add a method to ConcreteBytecode to get an instruction by its offset: return None if the offset is not exactly the start of an instruction, raise an IndexError if the offset is negative or out of the code.

Sounds good, other than...

Why not returning directly the instruction in your promote_concrete_index() function?

Because I want to analyze frame.f_code[frame.f_lasti:], so need the index for slicing.

eric-wieser avatar Nov 25 '16 10:11 eric-wieser

am I correct in assuming that the number of instructions in b and c where b = c.to_bytecode() are always the same?

Conversion from concrete to abstract bytecode should keep the same number of ConcreteInstr/Instr, but it adds Label objects. So a naive bytecode[index] doesn't work.

Because I want to analyze frame.f_code[frame.f_lasti:], so need the index for slicing.

Oh ok, so it makes sense to add a get_instr_index(offset) method. Getting the instruction object is as simple as bytecode[index] anyway.

vstinner avatar Nov 25 '16 10:11 vstinner

Getting the instruction object is as simple as bytecode[index] anyway.

Not quite, because bytecode[index] might be a SetLineNo, right? Also, index might ==len(self), which is ok for slicing, but not for lookup

eric-wieser avatar Nov 25 '16 10:11 eric-wieser

get_instr_index(offset) is the most explicit.

index(value=None, code_offset=None), which overloads Sequence.index

Hum, what does Bytecode.index(instr) currently?

vstinner avatar Nov 25 '16 10:11 vstinner

Hum, what does Bytecode.index(instr) currently [do]?

Return the index such that Bytecode[i] == instr

get_instr_index(offset) is the most explicit.

I worry that there needs to be some explicit mention of the code object in the function or argument name

eric-wieser avatar Nov 25 '16 11:11 eric-wieser

@eric-wieser Although I confess that I find it hard to follow the thread here and meandering train, I have a general sense of the overall kind of thing has been done through various combinations of programs I've written. At a high level, the Python trepan debuggers have a disassemble command which will show disassembly at a given frame offset. So in a sense they can retrieve instructions starting from that point.

Furthermore the debuggers can show you a deparse of the code around that point, using the deparse command. Underneath for either command you have a list of bytecode for the disassembly command, or a parse tree where the leaves are instructions as a namedtuple for the deparse command.

The debuggers rely on the libraries uncompyle6 and xdis. But there is a library that corresponds more to bytecode, although in some respects is more primitive, called xasm.

Probably ideal would be for my projects to align more with this library.

rocky avatar Mar 01 '18 15:03 rocky