cfr On identifier renaming

CFR tries to rename some identifiers during decompilation. This is nice because something has to be done, but ends up breaking the code in some ways.

i maintain a dex bytecode patching tool. the tool applies patches to bytecode. a patch is supposed to be written entirely in java, then compiled and dexed, and fed to my tool as well as the original bytecode to patch.

many times the patch has to define items that match the name of items in the original bytecode (when you call an original method from your code, when you replace or wrap an original method with your code, etc).

unfortunately javac has more restricted naming than bytecode, so in general obfuscated bytecode cannot be patched in this way.

for example: class a in package a and package a.a name-clash in javac, method 42 cannot be defined or invoked, etc.

i've implemented orthogonal solutions for this. some are:

use a map file rename the original bytecode identifiers you want to hook or use (pacakges, classes, members) to something that makes sense in your world. map the original bytecode, apply your patch, then unmap the result.
use the same map file to reverse-map your patch file before applying it. (this produces the same bytecode but outputs obfuscated patching diagnostics; it is only recommended for efficiency when applying released patches on small computers such as cell phones.)
chain map files so that AI-generated deobfuscation maps can be used as your manual map's input. this way when a new version of the original btecode is released (with totally different obfuscation), you can rerun the AI and most of your manual map file would not need to be touched.
use 'identifier codes' when creating your patch in java and decode them before applying the patch.

the last point is relevant to CFR. identifier codes are simply encoded bytecode identifiers that can be decoded on a bytecode-to-bytecode pass AFTER javac got the crap compiled.

identifier codes are composed of two parts 1) an optional 'label' part that is useful for the developer to document what the hell the identifier represents and gets discarded during decode, and 2) a required escaped part that, when decoded, will become the identifier.

identifier codes are described here: https://github.com/DexPatcher/dexpatcher-tool/blob/v1.8.0-beta1/test/patch/src/main/java/test/Main.java#L662-L735

i think maybe CFR should not rename during decompilation. instead a bytecode-2-bytecode pass would encode whatever needs to be encoded into legal java identifiers before the bytecode is fed to CFR. that doesn't mean that there should be a different tool or intermediate files: the mapping could be done in memory and maybe on demand (the way it is done in my tool, instead of cached) on the bytecode object model, before the object model is passed to the decompiler. the point is that they are distinct independent operations (if not passes). or the object model can be output to bytecode again. (...sigh... but you lack bytecode writing...)

the decompiled files can then be fed to javac and then the produced bytecode could be run through the identifier code decoder to recover the original naming of everything. (you could use the same bytecode transform codebase... if you had used ASM. but somebody else could write the decoder, the decoder is trivial anyway.)

this decouples renaming from decompiling. and the fact is, renaming is an operation that will always require to be tailored to the user, because many times if renaming is required, deobfuscation is required too. it is not enough to make it javac compatible: if you are renaming, then you want it to make sense as much as possible.

the decoder is trivial. what about the encoder (ie: the renamer)?

well it sure needs to escape sequences of its own 'code marker'. so if the code marker is _$$_ and an identifier is called __$$_iHateYou__ then yes, you need to encode/escape that identifier, that is obvious.

what else needs/could be done? a lot. to dissuade you from integrating this ever growing target into CFR, i'll show you the list of options i provide in my VERY FIRST version of my tool that supports identifier encoding. even if the CFR executable includes everything, i think the codebases should be separate, both ideally operating on a bytecode object model in memory (and the transformed code should ideally be writable back to bytecode too).

the release info is in the link, but i'll copy the relevant parts here and comment: https://github.com/DexPatcher/dexpatcher-tool/releases/tag/v1.8.0-beta1

identifier encode options:

--encode-source                encode identifiers in source

=> enable this transform. includes encoding bits where the code marker already happens.

--encode-map <file>            encode map file (repeatable option)

=> if you have an identifier map file, instead of just mapping the idents to the new names, you can encode the old names with a label that includes the mapped name. this way you can use potentially erroneous AI deobfuscation so that decompilation shows inferred info, but recompiling and decoding yields the original idents. changes are only temporary for code analysis, and the analyst can simultaneously see inferred and obfuscated names.

--invert-encode-map            use inverse of encode map file

--escape-non-ascii             escape non-ASCII characters

=> some obfuscators produce any UTF-16 crap. get that back to useful chars.

--escape-non-latin             escape non-ASCII/Latin-1 characters

=> same but for latin1 users.

--no-ascii-escapes             do not output ASCII escapes
--no-code-point-escapes        do not output code point escapes

=> which is your preferred escaping syntax?

--obfuscated-types <ptrn>      pattern for binary type names
                               (form: '[<pkg>/...][<cls>$...]<cls>')

=> which types are obfuscated? some obfuscators put all obfuscated classes in a single package, so you can configure that here.

--obfuscated-packages <ptrn>   pattern for non-qualified package names
--obfuscated-classes <ptrn>    pattern for non-qualified class names
                               (form: '[<cls>$...]<cls>')
--obfuscated-members <ptrn>    pattern for member names

=> otherwise you can match on the non-qualified names of stuff, including nested classes.

but in general, the tool can "detect" when a class is obfuscated without help: dex files are monolithic: everything is there except for the android framework. so all names defined in the dex are considered obfuscated, while undefined names must come from the framework, so those are not.

--encode-all-classes           encode all class names

=> you might want to encode all classes. that way they don't clash with package names, etc.

--encode-obfuscated-packages   encode obfuscated package names
--encode-obfuscated-classes    encode obfuscated class names
--encode-obfuscated-members    encode obfuscated member names

=> obfuscated idents (say by default, all internally defined classes) are not encoded by default. they are simple detected (say by using pattern matching) for use in other decisions (see below). but you can choose to encode everything obfuscated.

--encode-reserved-chars        encode names with reserved characters
--encode-reserved-words        encode names matching reserved words

=> make these idents legal for javac. (refers to JLS reserved words.)

--encode-class-hints           encode type hints in classes

=> class hints use the inheritance tree to infer info of a class. you've got class b, but that doesn't say much. it extends class a which extends framework class Activity. so class hints would (at least) encode b to something like __C_Activity_$$_b__, with type C meaning an outer class, and this is of course visible anywhere in code where b is referenced. this would happen only if b and a are obfuscated and Activity isn't. this happens by default if no pattern matching is provided. it also explains why obfuscation detection is separate from encoding: by default encoding is done only if there is something interesting to say about the ident. patterns can refine this. regarding hints, if the inheritance tree ends with Object, then no extra info is gathered. but the interface implementation graph is tried next, which could lead to multiple hints: each interface type is similarly resolved, then the valid hints are ordered and concatenated into the encoded ident.

--encode-member-hints          encode type hints in members

=> same for members, but different. field types or method return types are used to decorate obfuscated member names. so field x becomes __f_OutputStream_$$_x__. of course the type of a member could be obfuscated, so resolution happens as above, and the encoding is applied only if a hint is produced.

--encode-member-type           encode member type in members

=> encode the unhinted, raw, possibly native, type of a member in the member name. this option solves the JLS clashing of bytecode members such as int a; float a; int a(); float a();.

--no-identifier-type           do not encode identifier type

=> don't include the type (p, C, f, m, nested class info) in the encoding. if you use this option, java name clashes might happen.

--no-multiple-hints            only allow unique type hints

=> don't hint if superclass hinting fails and interface hinting produce more than one result.

--no-nested-classes            disable nested class processing

=> pretend nested classes are just outer classes with weird names. most obfuscators never produce nested classes anyway.

--ignored-hint-type <type>     fully qualified name of type
                               (use '-' to remove defaults)
                               (repeatable option)
--ignored-hint-types <ptrn>    pattern for binary type names
                               (form: '[<pkg>/...][<cls>$...]<cls>')

=> so yeah, we don't want to hint on stuff like Clonnable, Serializable, etc. some of these come predefined by default.

--encode-compilable            allow recompile of obfuscated code

=> and yessss..... this is you: this crap enables other crap so that, supposedly, the transformed bytecode can be decompiled in a way that javac likes it. and.... after compilation a round of ident code decoding will get the code back into it's original shape. what crap gets enabled? i don't remember, but there's the source code there.

takeaways...

decompiling is TOUGH
but renaming is not trivial either
so it is best to separate the codebases: do one thing and do it well
even if both codebases are shipped together as one tool, as i do
not being able to write class files will byte you back every now and then
if you ever plan to do something like this, consider using an existing syntax such as mine (the only one i'm aware of). you can also use my code except it is for dexlib2, but the bulk of it easy to port. it deals with horrible details such as precalculating the lengths of resulting strings because i'm too much of an efficiency freak. the model is: transforms apply dynamically as you walk the tree and don't consume memory. only diagnostics on bad transforms are cached so that diagnostics aren't repeated. some walkers can check the whole tree on demand at various times to produce early and separate diagnostics during patch development at the cost of performance, if transforms need to be debugged. my code is GPL but i can re-license what you need of it, if anything.
don't reinvent the wheel! if you ever embark on writing class files, give ASM (or whatever) a chance and use an adaptation layer over ASM rather than rolling you own class writer (and reuse next time! :) )

May 13 '20 17:05 Lanchon

...but don't give this too much thought, maybe. this is all a hack, over a hack, over...

cause next when i've some time i'll try to argue that you should stop development of CFR.

lol, yes, it's true, i'm not kidding... i'll post an issue soon.

thanks! :)

May 13 '20 18:05 Lanchon

Ok, so there's a Loooooot to unpack here, and I've had a crazy long day so I won't respond to this now....

But one thing does leap out ;) Please take the below in the humour it's intended.

@leibnitz27 's manifesto for new wheels.

Please remember - this sort of thing is done for fun. And reinventing wheels is fun. If you're anything like me, you spend your day being responsible and not reinventing wheels.

When you reinvent a wheel, you gain understanding of that wheel. The whole reason I did this entire project was because I didn't know java, so rather than read yet another crappy book, I wrote a decompiler.

And there are little upsides too (along sides the above). Consider https://github.com/leibnitz27/cfr/issues/75 - I can't imagine that would have been anywhere as easy to handle if I didn't have my own lovely wheels ;)

May 13 '20 18:05 leibnitz27

well CFR is an amazing wheel and i think u'll like the ideas in my post simply because they are interesting. i'm not saying you should rewrite CFR to use them, i'm just discussing alternative ways in which things could have been done, because we all learn from other wheels.

on the contrary, i'll try to argue next that you should stop CFR because there are better, newer, more fun wheels to invent :)

and i'll argue it to you because (because of CFR) you are now an absolute expert on java, bytecode, and compilers, and you are just the man for the job. i want you to obsolete the software i've been working on for the last 5 years, and CFR, in a single move :)

i'll write soon!

May 13 '20 19:05 Lanchon

lol it took some time but here it is: https://github.com/leibnitz27/cfr/issues/186

if you decide to go that route, the info in this post becomes obsolete. if you don't, let me know what are your thought on this. thanks!

Jun 16 '20 21:06 Lanchon

cfr cfr copied to clipboard

On identifier renaming

takeaways...

cfr
cfr copied to clipboard