isle icon indicating copy to clipboard operation
isle copied to clipboard

Average Compiler Entropy Measuring

Open jcampbell05 opened this issue 10 months ago • 4 comments

I was watching the Youtube video that explained the issues this project has been having with compiler entropy making it really hard to get an accurate measurement on how close you are with the decompilation.

From what I remember the compiler can embed things like timestamp into the build which can shift the memory layout a bit. If you want a more accurate measurement, I would suggest building the project multiple times to get an average level of accuracy.

James

jcampbell05 avatar May 08 '25 16:05 jcampbell05

Thank you for the suggestion. We are already doing what we call "entropy builds" (multiple builds with different entropy), and aggregate the data to get a better measurement of accuracy: https://github.com/isledecomp/isle/releases/tag/continuous-accuracy

However the progress SVGs in the README.md don't reflect these numbers yet. CC @disinvite

foxtacles avatar May 08 '25 19:05 foxtacles

Maybe also to clear up some confusion: The same code always produces the same binary (maybe up to insignificant timestamps). The actual problem is that code changes in one place influence unchanged code elsewhere in the compilation unit, so changing a single function can impact hundreds of others.

jonschz avatar May 09 '25 06:05 jonschz

Maybe also to clear up some confusion: The same code always produces the same binary (maybe up to insignificant timestamps). The actual problem is that code changes in one place influence unchanged code elsewhere in the compilation unit, so changing a single function can impact hundreds of others.

Thanks for the clarification, sounds like it could be page alignment which is tricky business

jcampbell05 avatar May 09 '25 09:05 jcampbell05

So I’ve been thinking about this a little more and I’m wanting to try out an idea I had.

From what I understand you have non-matching reverse engineering which makes your life easier for writing code but hard to verify it works the same way.

And byte matching which is harder to write to be the same but easier to verify is the same using tools

Instruction matching is somewhere in the middle but even that’s difficult with compiler entropy changing instructions when optimising.

I was wondering if we could instead do read/write matching to get best of the higher level instruction/non matching app approach but with a lower level verification

Since we know the location of each function in the binary my idea would be to write a simple hook into that function or debugger which can record the values passed in in and the values passed out of that function during a standard play through

Once we have a database of things we expect each function to be fed, and a database of values we expect that function to generate

In theory we could in effect create a kind of unit test, feed the reverse engineered functions those recorded values and see if it also writes out those recorded values

So we can have the ease of writing non matching / instruction level code

But have the ease of verifying & measuring accuracy using byte level matching on the results of this functions

Aka if we know the memory state looked like X after running the function from original game, then we should be able to validate same with decompiled functions

In theory with the ability to compile our own Lego island we could inject the original functions in order to gather this data more easily than it would have been with the original binary which may have needed DLL injection

jcampbell05 avatar May 11 '25 15:05 jcampbell05