OrangeC icon indicating copy to clipboard operation
OrangeC copied to clipboard

[Feature request] Generating IL that can be recompiled back after being decompiler to C# by DotPeek

Open claunia opened this issue 5 years ago • 53 comments

I don't know if it can be done easily, but, well, it would make it effectively a good, and free, C to C# converter.

claunia avatar May 20 '20 00:05 claunia

This is quite literally impossible because of how C++/CLI works. A lot of C constructs don't exist in C#, you might get bindings or some attempts at generators, but a full TRANSPILER is a lot of effort for minimalistic gains in this area.

chuggafan avatar May 20 '20 02:05 chuggafan

Just generate the DLL or EXE using the occil and use a .net decompiler to generate the C# code...

https://www.jetbrains.com/decompiler/

bencz avatar May 20 '20 02:05 bencz

That only works if the entire output is managed code instead of mixed-managed code, which is an entirely possible thing that can happen. I don't think jet-brains works if it's mixed-managed.

chuggafan avatar May 20 '20 02:05 chuggafan

@chuggafan this has NOTHING to do with C++/CLI. This is a C compiler that generates IL and assembles PURE .NET, there's not mixed-managed here (besides P/Invokes but those are managed).

@bencz there are IL constructs that are not legal C#, and while jetbrains happily give you C# for it, you cannot compile it back

claunia avatar May 20 '20 02:05 claunia

Well, why not put in the work to get the C# to be re-compilable? Effectively would be the same thing, and remaking the compiler into a (decent) transpiler is an incredible amount of work because you'd have to change the frontend, middle-end, and backend all of which would have to be modified to effectively produce C# code.

Is there anything that's reasonable within the scope of the current compiler to achieve this or would be effective in helping us reach this feature easier? This would also require a lot of help because there's effectively 1 person (I barely contribute atm) doing all of the work on this. Even though I love to program in C# I don't see how this fits in with the already large list of milestones that exist or any of the "theoretical" goals of this compiler set.

chuggafan avatar May 20 '20 02:05 chuggafan

well you say put the work to get the C# to be recompilable to which I ask why not put a flag to emit only IL that is legal C#? There's already a flag to encase it into a class.

And this wouldn't require rewriting the whole compiler, that's like saying changing a compiler to stop emitting i686 instructions and emit only i586 requires rewriting the whole compiler. And this is not true.

As to how it fits, there is a lot of old C code, and newer C code, that doesn't have a proper C# equivalent, cannot be easily ported (coz preprocessor mostly), and cannot be just P/Invoked as some environments do not allow this (like webassembly).

claunia avatar May 20 '20 04:05 claunia

while I agree with you @claunia it could be done with a rewrite to the backend (or perhaps an entirely new backend dedicated to this purpose) and maybe some minor alterations to the front end to pass the names of enumerations through, I also think this is kinda beyond the scope of what I'm trying to do here; also even though I might do it anyway there is a lot of work piled up ahead of it, currently I have five milestones planned and none of them are trivial to do. The results might also be sub-optimal, a single file with an entire program, possibly expressions being done naively (as I don't know how much effort it would be to take the expressions as found in the backend and make complicated statements out of them), a lot of pinvoke declarations for msvcrt.dll, and so forth. Also strings in C can not be considered the same as strings in C# because they have to be available to the C runtime, so there would have to be some way to allow for that. Right now the IL just makes byte arrays for them but it might get trickier to do it right for pure C# code...

While it could be done, I just don't have the time at the moment to do it...

LADSoft avatar May 20 '20 12:05 LADSoft

@LADSoft What do you agree to?

@claunia said one thing I agree that may be relative trivial and then useful to be able to later do post-processing outside of OrangeC:

put a flag to emit only IL that is legal C#

Is this possible?

GitMensch avatar May 20 '20 12:05 GitMensch

@LADSoft I am not in any hurry

About strings, P/Invoke has a way to automarshal strings to char*. The problem is detecting where char* IS a string and where it is a byte array. Easiest way, just let the user fix that.

In my opinion, if the IL can be passed to a decompiler, pasted as C#, and compiled again, most of the work is done, and whatever is left to fix manually is still a very minor fraction of what manual code conversion supposes.

Again I'm not in any hurry.

claunia avatar May 20 '20 12:05 claunia

@claunia, the main problem with this compiler is it has got to keep things in a format that can be passed to unmanaged code, that is the primary complication for strings. But with some effort a lot could be done with conversions to make at least string constants more C# friendly.

There was issue #28 that asked basically for a 'safe' version of the runtime (compile the runtime as managed code). But a lot more needs to be done to make that a reality.

@GitMensch, @claunia I'm not sure what 'legal C#' means. If you mean code the C# compiler can compile when not in unsafe mode, that is a not a trivial task because it requires the C runtime to be compiled as managed code. Problem being as I recall some of the constructs needed to support the PInvokes down to unmanaged code are considered 'unsafe' by the C# compiler. (the unmanaged code typically being msvcrt.dll at this point)

But maybe you mean something other than 'safe' code. If so let me know and I'll see if it is possible in the short term.

Anyway since you don't mind waiting I'll find a future milestone to put this in. I'm fine with doing it just cannot do it immediately!

LADSoft avatar May 20 '20 16:05 LADSoft

@LADSoft I mean something different than "unsafe". I got szip and beha to compile (I had to use stubs for unix functions but I dont care that can be postprocessed), then passed the output to JetBrains and it added comments about some thing then copied back to Rider there it complained that was not valid C#. I will paste the decompiled output that fails later today.

claunia avatar May 20 '20 16:05 claunia

So this is the output from Jetbrains DotPeek:

  public static unsafe void ac_out(ushort low, ushort high, ushort tot)
  {
    uint num = (uint) Module.L_0_d21cc7f7 - (uint) Module.L_1_d21cc7f7 + 1U;
    Module.L_0_d21cc7f7 = (ushort) ((uint) (ushort) (num * (uint) high / (uint) tot - 1U) + (uint) Module.L_1_d21cc7f7);
    Module.L_1_d21cc7f7 += (ushort) (num * (uint) low / (uint) tot);
    if ((((int) Module.L_0_d21cc7f7 ^ (int) Module.L_1_d21cc7f7) & 32768) == 0)
    {
      Module.L_5_d21cc7f7 <<= 1;
      if (((int) Module.L_1_d21cc7f7 & 32768) != 0)
        Module.L_5_d21cc7f7 |= (short) 1;
      if (((int) Module.L_5_d21cc7f7 & 256) != 0)
      {
        // ISSUE: cast to a reference type
        // ISSUE: explicit reference operation
        ^(sbyte&) ((IntPtr) &Module.ob + Module.obl) = (sbyte) (byte) ((uint) Module.L_5_d21cc7f7 & (uint) byte.MaxValue);
        ++Module.obl;
        if (Module.obl == 8192)
          Module.bwrite();
        Module.L_5_d21cc7f7 = (short) 1;
      }
      while (Module.L_3_d21cc7f7 != (short) 0)
      {
        --Module.L_3_d21cc7f7;
        Module.L_5_d21cc7f7 <<= 1;
        if (((int) ~Module.L_1_d21cc7f7 & 32768) != 0)
          Module.L_5_d21cc7f7 |= (short) 1;
        if (((int) Module.L_5_d21cc7f7 & 256) != 0)
        {
          // ISSUE: cast to a reference type
          // ISSUE: explicit reference operation
          ^(sbyte&) ((IntPtr) &Module.ob + Module.obl) = (sbyte) (byte) ((uint) Module.L_5_d21cc7f7 & (uint) byte.MaxValue);
          ++Module.obl;
          if (Module.obl == 8192)
            Module.bwrite();
          Module.L_5_d21cc7f7 = (short) 1;
        }
      }
      Module.L_1_d21cc7f7 <<= 1;
      Module.L_0_d21cc7f7 <<= 1;
      for (Module.L_0_d21cc7f7 |= (ushort) 1; (((int) Module.L_0_d21cc7f7 ^ (int) Module.L_1_d21cc7f7) & 32768) == 0; Module.L_0_d21cc7f7 |= (ushort) 1)
      {
        Module.L_5_d21cc7f7 <<= 1;
        if (((int) Module.L_1_d21cc7f7 & 32768) != 0)
          Module.L_5_d21cc7f7 |= (short) 1;
        if (((int) Module.L_5_d21cc7f7 & 256) != 0)
        {
          // ISSUE: cast to a reference type
          // ISSUE: explicit reference operation
          ^(sbyte&) ((IntPtr) &Module.ob + Module.obl) = (sbyte) (byte) ((uint) Module.L_5_d21cc7f7 & (uint) byte.MaxValue);
          ++Module.obl;
          if (Module.obl == 8192)
            Module.bwrite();
          Module.L_5_d21cc7f7 = (short) 1;
        }
        Module.L_1_d21cc7f7 <<= 1;
        Module.L_0_d21cc7f7 <<= 1;
      }
    }
    for (; ((int) Module.L_1_d21cc7f7 & 16384) != 0 && ((int) Module.L_0_d21cc7f7 & 16384) == 0; Module.L_0_d21cc7f7 |= (ushort) 32769)
    {
      ++Module.L_3_d21cc7f7;
      Module.L_1_d21cc7f7 <<= 1;
      Module.L_1_d21cc7f7 &= (ushort) short.MaxValue;
      Module.L_0_d21cc7f7 <<= 1;
    }
  }

and this is the original code

static U16B h,l,v;
static S16B s;
static S16B gpat,ppat;

#define BLOCKLEN 	8192

#define putbyte(c) {ob[obl++]=(c);if(obl==BLOCKLEN)bwrite();}

#define putbit(b) 	{ ppat<<=1;				\
			  if (b) ppat|=1;			\
			  if (ppat&0x100) {			\
				putbyte(ppat&0xff);		\
				ppat=1;				\
			  }					\
			}

void ac_out(U16B low, U16B high, U16B tot) {
    
    register U32B r;
    
    r=(U32B)(h-l)+1;
    h=(U16B)(r*high/tot-1)+l;
    l+=(U16B)(r*low/tot);
    if (!((h^l)&0x8000)) {
	putbit(l&0x8000);
	while(s) {
	    --s;
	    putbit(~l&0x8000);
	}
	l<<=1;
	h<<=1;
	h|=1;
	while (!((h^l)&0x8000)) {
	    putbit(l&0x8000);
	    l<<=1;
	    h<<=1;
	    h|=1;
	}
    }
    while ((l&0x4000)&&!(h&0x4000)) {
	++s;
	l<<=1;
	l&=0x7fff;
	h<<=1;
	h|=0x8001;
    }
}

Comments were added by DotPeek itself, and those lines are precisely, the ones that cannot be compiled back. This is what matters, everything else is a nice to have, but can be fixed in preprocessing.

Interestingly enough it always happens in the defined macros.

Sidenote: in the future it would be nice to generate a PDB file, this would allow DotPeek to show proper variable names instead of L_1_whatever? Even comments can be get back from PDB files.

claunia avatar May 20 '20 18:05 claunia

Oh, full source code is at http://www.claunia.com/files/beha10s.zip It expects to be compiled on BeOS so some POSIX functions need to be stubbed to link.

claunia avatar May 20 '20 18:05 claunia

This is a manual conversion from C to C# of that function (I do not expect it to be the same, I'm not crazy, but I think is an example of maybe just convert code macros into methods, if that's at all possible)

        static ushort h, l, v;
        static short s;
        static short gpat, ppat;

        const int BLOCKLEN = 8192;

        static byte[] ob;
        static int obl;

        static void putbyte(ushort c)
        {
            ob[obl++] = (byte) (c);
            if (obl == BLOCKLEN) bwrite();
        }

        static void putbit(ushort b)
        {
            ppat <<= 1;
            if (b != 0) ppat |= 1;
            if ((ppat & 0x100) != 0)
            {
                putbyte((ushort) (ppat & 0xff));
                ppat = 1;
            }
        }

        void ac_out(ushort low, ushort high, ushort tot)
        {
            uint r;

            r = (uint) (h - l) + 1;
            h = (ushort) ((r * high / tot - 1) + l);
            l += (ushort) (r * low / tot);
            if (!(((h ^ l) & 0x8000) != 0))
            {
                putbit((ushort) (l & 0x8000));
                while (s != 0)
                {
                    --s;
                    putbit((ushort) (~l & 0x8000));
                }

                l <<= 1;
                h <<= 1;
                h |= 1;
                while (!(((h ^ l) & 0x8000) != 0))
                {
                    putbit((ushort) (l & 0x8000));
                    l <<= 1;
                    h <<= 1;
                    h |= 1;
                }
            }

            while (((l & 0x4000) != 0) && !((h & 0x4000) != 0))
            {
                ++s;
                l <<= 1;
                l &= 0x7fff;
                h <<= 1;
                h |= 0x8001;
            }
        }

claunia avatar May 20 '20 18:05 claunia

@claunia ok let me see what I can do about the problems DotPeek references. Probably there is another way to do what is being done there... I'll also see what I can do about using more reasonable names than the L_XXX stuff occil is doing. When I designed this I wasn't thinking in terms of human consumption but rather code generation.

I don't think I can turn macros into functions in the short term, but can look at that in the longer term if you would like...

LADSoft avatar May 20 '20 22:05 LADSoft

I don't think I can turn macros into functions in the short term, but can look at that in the longer term if you would like...

I'd like that, but "obviously" you'd have to create one for each different argument type (given a sample that uses int and char...). Please create a FR issue for this that may be done "sometime in the future, possibly 2021 or later".

"Using more reasonable names" would be nice, too and something that may be done additional to the PDB generation which @claunia asked for and is a general issue that already exists.

In any case I suggest that @claunia or @LADSoft renames this issue as it isn't about generating C# instead of IL any more, is it?

GitMensch avatar May 20 '20 22:05 GitMensch

@gitmensch thanks I forgot to comment on the PDB issue. We have an issue for PDBs already #252, which is to get done in milestone5. I had offered to fix the names in the il code though, so that dotpeek will at least give readable text.

I'm undecided whether to change the name of this issue, since I'm planning on doing what the title says at a later date we'll still need a similar one. I might just make two new issues for the things we talked about here and update the PDB issue with the new info as well.

LADSoft avatar May 21 '20 00:05 LADSoft

@claunia I had to bracket the new types in types.h with #define MY_UNSIGNED_TYPES ... new types #endif

because it broke the build sigh.

Also you can make szip compile with occil by adding #define setmode _setmode and removing your stub.

LADSoft avatar May 21 '20 00:05 LADSoft

So now with the new issue title the one problem I see with performing this change is that even for regularly compiled C# dotPeek has major problems with creating re-compilable binaries (particularly with extension methods IIRC). Even though dotPeek is the best C# decompiler I know of, it still struggles with not outputting clear names for everything. So I suggest that we get it "GOOD ENOUGH TO READ" (e.g. macros are replaced with constexpr -> get compiled to runtime functions).

While I agree that it does make sense to make it so that dotPeek can decompile our code via these methods, I do not think (all-in-all considering how dotPeek is closed source and we won't be able to figure out exactly why certain names are just unicode spaces), it is worth it to spend a great deal of time on the name issue.

My 2c on my previous usages of dotPeek from back when the C# runtime was closed source instead of open and I was mucking around trying to figure out what went on internally everywhere.

chuggafan avatar May 21 '20 12:05 chuggafan

@chuggafan, Based on the data available in the IL I don't know if it would be possible to fix function argument names but it probably wouldn't be possible to fix function local variable names. At least unless DotPeek handles reading the PDB. The PDB is being left for another day though... But I think it may be worth it to at least shoot for fixing the static names... the ones that start L_ and have a long string of numbers on the end were probably generated by the compiler so I have control over them. That at least gives something more than what we've got.

LADSoft avatar May 21 '20 16:05 LADSoft

Yhea, that's a reasonable goal, the problem I was remembering happened with the actual function names themselves, not even their arguments. I think it had to do with constructors or something of the sort, but it's been about ~3 years since I've used dotpeek heavily and a lot could have changed in the meantime for that.

chuggafan avatar May 21 '20 17:05 chuggafan

yeah the static and dynamic constructors are named wierdly so the IL reader can find them, DotPeek would have to figure out what class they go with to figure out the real names.... the static constructors are mostly invisible in C# so they don't even have a name... Dot peek would have to do real work to associate variables with their static construction....

LADSoft avatar May 21 '20 19:05 LADSoft

DotPeek certainly can read the PDBs.

claunia avatar May 23 '20 02:05 claunia

cool.

LADSoft avatar May 24 '20 01:05 LADSoft

I've got something workable, my working version of occil generates code csc can recompile. But a small amount of search and replace has to be used on the dotpeek output to get rid of IntPtr and UIntPtr casts... I don't know a way to get dotpeek to elide them.

The big change was pointers sometimes have to be pinned... but there are a variety of smaller changes that were made to clean up code generation.

Right now I'm trying to get szip to run properly when recompiled with csc. I think it is almost there though...

LADSoft avatar May 24 '20 21:05 LADSoft

had to rework since not enough stuff was being accessed with 'fixed' modifier. There are problems with passing things down via pinvoke at the moment... which got me thinking maybe I should digress again and at least fix string constants to be readable...

Meanwhile there are a couple of bugs in DotPeek that I can't work around.... will probably try to find a way to report them when I get this done...

LADSoft avatar May 27 '20 13:05 LADSoft

Before wasting time on a proprietary decompiler I suggest to go with something like https://github.com/icsharpcode/ILSpy

GitMensch avatar May 27 '20 14:05 GitMensch

@GitMensch I should clarify the issues with DotPeek are quite minor; a missing cast in one place (even when I explicitly put the cast into the IL code stream) and in one place it messes up the end of a 'fixed' statement by saying there is a misplaced 'unpin' instruction. The latter I verified by compiling the decompiled code with CSC then decompiling it again... same results. In one case add the cast and in the other remove the offending line of code; both are flagged as invalid by CSC in any case...

My problem right now is figuring out why fprintf doesn't work when compiled from CSC and I don't have any reason to believe that is also a DotPeek bug... So I don't have much problem continuing with DotPeek.

I'll run it through ILSPY though too to see what happens :)

LADSoft avatar May 27 '20 16:05 LADSoft

i got the compiler to generate code that wouldn't inspire DotPeek to generate intptr references and figured out to specify /platfrom:x86 when compiling the decompiled code with CSC, and szip works halfway - compresses but doesn't decompress. Tonite I will try this with ILSpy to see if that works any better.

LADSoft avatar May 28 '20 17:05 LADSoft

the decompress problem was due to another bug in DotPeek (which coincidentally is also in ILSPY) but I worked around it by pulling out a compiler optimization that isn't really needed for MSIL. OCC does things that CSharp compilers would never do, apparently...

ILSPY also has problems, the two immediately obvious are it doesn't support 'fixed' statements, which are essential for this. And it doesn't coalesce array references back into array references either, which is another source of IntPtr references...

I reported bugs both to the DotPeek team and the ILSPY team.

Tomorrow I'm going to see about adding support for using MSIL strings to make the code more readable then we can close this.

LADSoft avatar May 29 '20 00:05 LADSoft