ScratchABlock icon indicating copy to clipboard operation
ScratchABlock copied to clipboard

Structuring switch statements

Open maximumspatium opened this issue 6 years ago • 9 comments

Hello Paul,

I recently picked up another real-world code snippet that makes use of the switch control operator. You'll find its Pseudo-C code here.

Recovering switch statements is another big task, so I decided to start somewhere and to conduct a couple of experiments.

Compiler usually employ at least the following two strategies in order to convert (aka lower) high-level switch blocks into assembly:

  • balanced binary search trees if case values are sparse or contain large gaps
  • jump table (if case values are dense)

The above mentioned snippet utilizes the former strategy. Although SABl does currently emit the correct C-output (what's the good thing you can be proud of!), humans will have hard time understanding that if-else spaghetti.

Because I wasn't able to find any suitable algorithm in the (de)compilation literature I tried to code up a recognition of such structures. Here is what I got so far.

Although I'm now getting all switch-related CFG nodes recognized and classified properly, I have no success trying to collapse all these nodes into a generic switch control class. I replace the header node with Switch(BBlock), remove all related nodes from CFG and then connect the header node with the switch successor node. This unfortunately seems to screw up the graph.

It looks like I'd need to implement another collapsing strategy because switch cases can contain other nested control structures to be structured first.

I'm thinking about removing all search nodes (i.e. the nodes required for binary search) and construct a n-way node branching to the case nodes. That would be probably the result of converting a jump table into CFG.

Do you have any suggestions on how to proceed?

maximumspatium avatar Jan 04 '18 02:01 maximumspatium

Some quick replies:

Recovering switch statements is another big task

My attitude towards that was: "well, I'm ok with having just if/if else chains" ;-).

balanced binary search trees if case values are sparse or contain large gaps

I believe I read up on that somewhere, and it makes good sense. But in all fairness, I don't remember ever seeing such a code - for balanced binary tree. Well, maybe because it's indeed spaghetti to human eye, and I didn't recognize an optimize search tree in cases I've seen.

I replace the header node with Switch(BBlock), remove all related nodes from CFG and then connect the header node with the switch successor node. This unfortunately seems to screw up the graph.

The reason for that would be that you're doing something wrong ;-). And the usual way to resolve it would be: 1) make a sample file with by minifying number of branches to process in it; 2) remove other unimportant code from it; 3) NEW! (of the last month) contribute as a testcase; 4) patiently work towards resolving issues with it.

Re: 3), I've added https://github.com/pfalcon/ScratchABlock/commit/f15358bcd48bffcc18c1760bc298afbdef7a59f6 which might address some concerns with contributing real-world code as testcases. It's quite adhoc, and btw, I recommend to generate PseudoC files with addresses prefixed to each instruction anyway (or it's pretty hard to review stages of processing).

It looks like I'd need to implement another collapsing strategy because switch cases can contain other nested control structures to be structured first.

Yes, what should be structured first, should be structured first. SABl currently does it in a typical monte-carlo way, which is neither optimal nor ideal. But that "allows" one to tighten up matching patterns (like a case you witnessed, where While matcher matched completely unrelated node and created broken graph, I bet you have a similar case). Refactoring that to a proper post-order matching is in my todo list, which requires first growing test corpus.

pfalcon avatar Jan 05 '18 23:01 pfalcon

I'm thinking about removing all search nodes (i.e. the nodes required for binary search) and construct a n-way node branching to the case nodes.

Sure, that's how it should be done. No idea if you can find such nodes all at once, and then convert all at once, or would collapse one by one into Switch node. Majority of current structuring passes are such incremental (I believe I can put it that way).

I'd also suggest to start with handling just "proper structural switch", without C hacks like missing break's.

pfalcon avatar Jan 05 '18 23:01 pfalcon

Btw, I played once with handling jumptables, yeah (was quite adhoc).

pfalcon avatar Jan 05 '18 23:01 pfalcon

make a sample file with by minifying number of branches to process in it contribute as a test case

Here is a testcase...

I suppose that the output diff should contain the desired switch statement despite the test breakage, right?

maximumspatium avatar Jan 06 '18 01:01 maximumspatium

Here is a testcase...

Looks simple enough to be grokkable by a human. Please submit as a PR.

I suppose that the output diff should contain the desired switch statement despite the test breakage, right?

No-no, the idea is that testsuite always passes and enables fear-free refactoring. So, initial submission should contain whatever master produces, and it should be updated as master updates (hopefully with improvements to output, not regressions).

pfalcon avatar Jan 06 '18 08:01 pfalcon

Btw, @maximumspatium, didn't you want to improve/elaborate .dot CFG output? ;-) (just trying to shove more work items from my queue to yours ;-) ).

pfalcon avatar Jan 06 '18 08:01 pfalcon

@maximumspatium : Continuing from https://github.com/pfalcon/ScratchABlock/pull/19#issuecomment-355748354, did you see: https://github.com/pfalcon/ScratchABlock/blob/master/decomp.py#L176 ?

That should look familiar, and can be summarized as: SABl already contains passes to structure sequences of if statements as a switch like statement ("well, I'm ok with having just if/if else chains"). Exception they work with "linear" sequences, not trees.

New pass(es) you write would thus subsume that processing (error in https://github.com/pfalcon/ScratchABlock/pull/19#issuecomment-355748354 hints about that). Of course, in my list, first my passes should be "recovered" to work as expected (they did once upon a time). The testcase for that is https://github.com/pfalcon/ScratchABlock/tree/master/tests/decomp/030-fun_bfffd0eb

pfalcon avatar Jan 06 '18 16:01 pfalcon

@maximumspatium, When dealing with if's, pay attention to abnormal selection patterns. (Google "abnormal selection" structure if in doubt. The general idea is that all branches of if/else should be dominated by the if header. If not, it's abnormal selection.) Currently, SABl doesn't test for abnormal selection thoroughly, which leads to incorrect decompilation, e.g. https://github.com/pfalcon/ScratchABlock/blob/master/tests/decomp/_xtos_set_interrupt_handler_arg/_xtos_set_interrupt_handler_arg.lst.exp.clean

pfalcon avatar Jan 19 '18 23:01 pfalcon

Generally, there's a gap with true structuredness and C-structuredness, which is more expressive. It may seem that it applies to loops, but it turns out it applies even to if too. "True" structuredness is when each structure element has one entry and one exit, up to a complete function. C-structuredness allow multiple exits (but not entries).

For example, following nice-looking C if is only nice-looking thanks to 2 returns:

if (cond1) {
  // do sth
  if (cond2) {
    // do sth
    return 1;
  }
}
return 0;

That however contains abnormal selection. Converted to fully structured shape, it's more verbose and less nicely looking.

So, the point is that SABl so far follows "fully structured" way, which means boring splitting nodes to resolve abnormal selection. But to get nice concise C-structuredness, that will later need to be reverted ;-I.

Or maybe not, maybe we can detect ladder-style abnormal selection (when one branch is dominated by header, and another - not) and apply multiple-return transformation. It's hard to decide what to do when and even harder to actually do and prove that it's right.

So, apparently stick to fully structured way for now. At least that makes it MISRA-compliant decompiler, LOL.

pfalcon avatar Jan 20 '18 00:01 pfalcon