lancer icon indicating copy to clipboard operation
lancer copied to clipboard

The unicode normalization step of the python interpreter can be abused

Open wasi-master opened this issue 2 years ago • 2 comments

Basically the suggesion in this reddit comment

From this article:

Python always applies NFKC normalization to characters. Therefore, two distinct characters may actually produce the same variable name. For example:

>>> ª = 1 # FEMININE ORDINAL INDICATOR
>>> a # LATIN SMALL LETTER A (i.e., ASCII lowercase 'a')
1

I've generated a mapping of these characters taken from this url.
The mapping can be found here. But beware that some characters may not be supported in python because I haven't tested every one of them.

I suggest adding another additional flag to enable this behaviour

I would have done it myself and opened a pr but I am too busy at the moment

wasi-master avatar Dec 30 '21 05:12 wasi-master

That sounds very promising! I like it. I am not sure if I find the time to implement it, but I am open for PRs.

LeviBorodenko avatar Dec 30 '21 09:12 LeviBorodenko

I actually implemented this in uglier, which was pretty much a copy of this project. In addition to abusing the Unicode normalization, it also uses cyrillic characters (which look a lot like latin chars) to make all variables look like they have the same identifier.

This:

def add_values(n1, n2):
    return n1 + n2


def add_10_to_string(n):
    return str(add_values(int(n), 10))


num = add_10_to_string("10")
print(num)

turns to:

def ADDVALUES(хxxх, хxхх):


    return хxxх + хxхх



def ADDTOSTRING(НННН):
    return st𝓇(𝕬𝔇𝔇𝔙𝕬𝕷𝓤𝔈𝔖(𝒾𝕟𝑡(НННН), 10))

НННH = 𝕬𝕯𝕯𝕿𝕺𝔖𝕿𝕽𝕴𝕹𝔊('10')
𝓅𝓇𝒾𝕟𝑡(НННH)

(notice it also abuses the normalization for built-ins, using something like 𝒾𝕟𝑡 for the built-in int function)

MonliH avatar Dec 30 '21 21:12 MonliH