reko icon indicating copy to clipboard operation
reko copied to clipboard

Some issues while analyzing an MSDOS binary

Open pulkomandy opened this issue 7 months ago • 2 comments

Hello,

I am attempting to reverse engineer an MSDOS executable: http://pulkomandy.tk/drop/CDISA80.EXE

I am using reko 0.12.0.0 (last release at the time of writing).

I have found some problems:

  • There is a segment called "code" but it seems empty. The functions are in another segment called 0810
  • 3 functions generate error during analysis: "An error occured while renaming variables. The given key 'l0810_1441' was not present in the dictionary.
  • "Reconstruct data types" step gives two warnings for "Non-integral switch expression"
  • A function at 0810:B3BF is not disassembled. It shows as hex dump in the disassembly view and as <anonymous> <unnamed> = <code>; in the decompiled view
  • "Structure editor" view in the GUI list no structures. I don't know if this is supposed to be filled automatically, if the previous errors are related to it being empty, or what is going on. The decompiled code uses various structures (named like Eq_52712) but I don't see where these are defined. In the files output of the disassembly they are in a .h file, so is it just missing support in the GUI currently? How would I go about renaming some structures if that's possible?
  • I think the analysis process identified several constants in the 5C09 segment, and I can see them in the generated .h file. But I do not see them in the GUI (the segment is visible as a monolithic hex dump and there is no way to organize it or name things as far as I can see).

Some imprvovement suggestions:

  • I can edit prototypes for methods to define the type and name of the parameters. However, when doing so, the parameter names are shown in the prototype of the function, but not in the decompiled body. So it is hard to keep track of which parameter is which after doing that, and it makes the code further from compiling.
  • There seem to be no way to rename segments. In this case it's not too bad because there are only two segments in use, I can remember that one is the code and one is the constants data (the others are for variables, but they are not relevant for analysis of the code as they all contain 00).
  • The C standard functions are not identified automatically (in this case I think they are from Microsoft C compiler judging from some strings in the executable). In this specific case I think a lot of the code in the executable is actually from the standard library: string handling, probably a printf implementation, memory allocation, file io. If there is a way to automatically recognize these, it would make my job investigating the other parts of the program a lot easier. Lokewise for the startup code before the main function (I saw this done automatically for Borland runtime in some other exe I have tested with?)

More generally I'm not sure how to set my expectations. In the past I have used disassemblers (not decompilers) where the process was iterative: run the tool, look at the output, modify a script file to hint the tool at what it got wrong, re-run the tool, and repeat (for example I do this with d52: https://github.com/jblang/d52, where I can not only name functions, but also add comments and annotations as I understand things). Is the workflow for reko somewhat similar to this? Or is it more a one-shot thing?

pulkomandy avatar May 05 '25 11:05 pulkomandy

Thanks for taking the time to write such a detailed issue report! I'll address your points in order.

* There is a segment called "code" but it seems empty. The functions are in another segment called 0810

Your program is packed with EXEPACK. When Reko first loads an MS-DOS binary, it detects the packer and executes the unpacker script. The unpacked executable code is put in a fresh ImageSegment arbitrarily called "code". Then, the "code" segment is chopped up into sub-segments once all segment selectors are discovered. Once you have the capability of renaming segments, this becomes less of an issue I think.

* 3 functions generate error during analysis: "An error occured while renaming variables. The given key 'l0810_1441' was not present in the dictionary.

I will investigate these errors.

* "Reconstruct data types" step gives two warnings for "Non-integral switch expression"

It is quite likely that those warnings are caused by other errors in the decompilation. In my experience with Reko, it's often been the case that errors early in the decompilation cause other errors to cascade forth, especially type errors. Once these are addressed, these errors tend to go away.

* A function at 0810:B3BF is not disassembled. It shows as hex dump in the disassembly view and as `<anonymous> <unnamed> = <code>;` in the decompiled view

This is likely caused by a segmented pointer to 0810:B3BF discovered during type inference. This happens after Reko's code-discovery phase ("scanning"), and Reko will not make an attempt to traverse the discovered pointer during this run. There should be a mechanism to remember all discovered pointers, and allow the operator to review them, discard some if necessary, and use them as "roots" for the next iteration of the decompiler.

Another approach that might give better results is to turn on the Shingle scan heuristic. When first opening an executable, select it in the Project browser and then run the Edit > Properties menu command. When the property sheet dialog appears, select the Scanning tab and then Shingle heuristic. This causes Reko to become very optimistic about what is code, and may discover the function at 0810:B3BF without further assistance.

* "Structure editor" view in the GUI list no structures. I don't know if this is supposed to be filled automatically, if the previous errors are related to it being empty, or what is going on. The decompiled code uses various structures (named like Eq_52712) but I don't see where these are defined. In the files output of the disassembly they are in a .h file, so is it just missing support in the GUI currently? How would I go about renaming some structures if that's possible?

Indeed, the structure editor is not completed yet. I've been meaning to give more love to the Reko GUIs but I'm a bit swamped with the other parts of Reko (+ "real life"). It would be very useful to hear your thoughts on what the structure editor user interface should look like.

* I think the analysis process identified several constants in the 5C09 segment, and I can see them in the generated .h file. But I do not see them in the GUI (the segment is visible as a monolithic hex dump and there is no way to organize it or name things as far as I can see).

That's a shortcoming that should be addressed.

Some imprvovement suggestions:

* I can edit prototypes for methods to define the type and name of the parameters. However, when doing so, the parameter names are shown in the prototype of the function, but not in the decompiled body. So it is hard to keep track of which parameter is which after doing that, and it makes the code further from compiling.

Good point. This should naturally be addressed.

* There seem to be no way to rename segments. In this case it's not too bad because there are only two segments in use, I can remember that one is the code and one is the constants data (the others are for variables, but they are not relevant for analysis of the code as they all contain 00).

Indeed there is currently no way of doing this. This is a good feature suggestion, and should be folded into the work on the segment editor (which is quite similar to the structure editor in many ways).

* The C standard functions are not identified automatically (in this case I think they are from Microsoft C compiler judging from some strings in the executable). In this specific case I think a lot of the code in the executable is actually from the standard library: string handling, probably a printf implementation, memory allocation, file io. If there is a way to automatically recognize these, it would make my job investigating the other parts of the program a lot easier. Lokewise for the startup code before the main function (I saw this done automatically for Borland runtime in some other exe I have tested with?)

This is a frequently requested feature. It's challenging to find a solution to matching instructions that isn't too tied to a particular CPU architecture. Some tricks for matching instructions work great on x86, but not as well on RISC architectures and vice versa. I think initially it would be useful for a user to be able to point at a function and say: "I know this is malloc, so remember that in my personal settings, and in the future, when you see this pattern of bytes / basic blocks / opcodes, just call it malloc". Naturally those saved "fingerprints" should be exportable, and should be able to be generated automatically by loading a .lib or .a file. Then issues arise about how to disambiguate functions whose "fingerprints" are identical, and so on. What would your suggestion be for a minimum viable feature be for "function fingerprinting"?

In your binary, as a first step, you could mark the code at address 0810:B564 as being alloca, with the following C signature:

[[reko::returns(register,"ax")]] void __near * __near alloca([[reko::arg(register, "ax")]] int bytes);

That will greatly improve the output of the decompilation.

More generally I'm not sure how to set my expectations. In the past I have used disassemblers (not decompilers) where the process was iterative: run the tool, look at the output, modify a script file to hint the tool at what it got wrong, re-run the tool, and repeat (for example I do this with d52: https://github.com/jblang/d52, where I can not only name functions, but also add comments and annotations as I understand things). Is the workflow for reko somewhat similar to this? Or is it more a one-shot thing?

Reko's intended workflow is similar to the one you're describing. You load and unpack the code, scan it to find some procedures, add comments, rename procedures, and in particular, let Reko know about types. Then you iterate, by restarting the decompiler, but now armed with more information.

uxmal avatar May 06 '25 00:05 uxmal

Your program is packed with EXEPACK. When Reko first loads an MS-DOS binary, it detects the packer and executes the unpacker script. The unpacked executable code is put in a fresh ImageSegment arbitrarily called "code". Then, the "code" segment is chopped up into sub-segments once all segment selectors are discovered. Once you have the capability of renaming segments, this becomes less of an issue I think.

Not really an issue, but it is a bit confusing in the UI. If these are sub-segments of "code", maybe they should show a level under it in the project tree? Or maybe the remaning empty "code" segment should be hidden, as it is useless.

This is likely caused by a segmented pointer to 0810:B3BF discovered during type inference. This happens after Reko's code-discovery phase ("scanning"), and Reko will not make an attempt to traverse the discovered pointer during this run.

I have continued analyzing the binary, this seems to be the runtime error handle from the C runtime (called in case of stack overflows, no FPU present, etc). Maybe it does not have the standard ABI.

It would be very useful to hear your thoughts on what the structure editor user interface should look like.

I don't really know, having just started with this. I guess there should be a list of structures, and I should be able to see and edit their content (field names and types)? A text editor where I can change the structure, maybe with a C-like syntax, should be fine for a start?

This is a frequently requested feature. It's challenging to find a solution to matching instructions that isn't too tied to a particular CPU architecture. Some tricks for matching instructions work great on x86, but not as well on RISC architectures and vice versa.

I don't really expect a fully generic automatic matching. The checksum matching and exportable "templates" as you describe would be fine. Maybe it can be done with some scripting already? When identifying a specific runtime library (in this case, some old version of Microsoft C), maybe some structures can also be identified (things like the FILE strcture used internally by fopen) to help with guessing datatypes in other parts of the code?

I'm happy to share my reko project for this executable, where I have identified several functions. But I did not write much function prototypes annotations yet.

In your binary, as a first step, you could mark the code at address 0810:B564 as being alloca, with the following C signature:

I continued analyzing the code (also with IDA) and I came to the conclusion that this is not the alloca function. It is used in the function prologue of many functions to allocate space for the local variables. I don't know if reko has understanding of this and uses it to reconstruct local variables.

And, unfortunately, I am unable to re-run the analysis after reloading my project. Some functions give me "Multiple different values of stack delta in procedure XXX when processing RET instruction", then "An internal error occured whle analyzing CDISA80.EXE. Object reference not set to an instance of an object." And further errors in the next steps of the analysis ("Phi functions will be ignored" and the functions are indeed not fully analyzed anymore).

Speaking of that, it would be great to have a way to copy these errors from the Diagnostics tab to the clipboard so I can share them here.

pulkomandy avatar May 06 '25 07:05 pulkomandy