Pymem icon indicating copy to clipboard operation
Pymem copied to clipboard

Unable to read any UTF-16 string

Open knenkne opened this issue 1 year ago • 9 comments

Hey! I've been using pymem lately and found out that I can't read any UTF-16 strings. Let's take a look at this byte's representation of Hello string:

UTF-8

0x48 0x65 0x6C 0x6C 0x6F

UTF-16-LE

0x48 0x00 0x65 0x00 0x6C 0x00 0x6C 0x00 0x6F 0x00

As you can see there is empty bytes added in UTF-16 strings and that's how things work, but current solution probably breaks, because of these empty appended bytes:

def read_string(handle, address, byte=50, encoding='UTF-8'):
    buff = read_bytes(handle, address, byte)
    i = buff.find(b'\x00')
    if i != -1:
        buff = buff[:i]
    buff = buff.decode(encoding)
    return buff

May be we can ignore this truncation for UTF-16, what you guys think?

knenkne avatar Dec 06 '24 16:12 knenkne

hmm perhaps we could do something like

def read_string(handle, address, byte=50, encoding='UTF-8', search_bytes=b'\x00'):
    buff = read_bytes(handle, address, byte)
    i = buff.find(search_bytes)
    ...

I don't see many null terminated wide strings but I assume they usually end with \x00\x00?

StarrFox avatar Dec 07 '24 13:12 StarrFox

hmm perhaps we could do something like

def read_string(handle, address, byte=50, encoding='UTF-8', search_bytes=b'\x00'):
    buff = read_bytes(handle, address, byte)
    i = buff.find(search_bytes)
    ...

I don't see many null terminated wide strings but I assume they usually end with \x00\x00?

Pardon me, but why do we even need this empty byte search?

knenkne avatar Dec 07 '24 22:12 knenkne

to read null-terminated strings

StarrFox avatar Dec 07 '24 22:12 StarrFox

to read null-terminated strings

Gotcha, I would suggest not to add an extra argument, since you need to redefine it every time you work with UTF-16 strings, but to use rfind instead of find, do null-terminated strings have empty byte always at the end? By this we will achieve better performance, since you don't need to go through all the bytes to the end, and both UTF-16-le/be working if It's terminated.

UPDATE: To support wide-strings we can may be first search for an \x00\x00 and only then \x00

knenkne avatar Dec 07 '24 22:12 knenkne

to read null-terminated strings

Gotcha, I would suggest not to add an extra argument, since you need to redefine it every time you work with UTF-16 strings, but to use rfind instead of find, do null-terminated strings have empty byte always at the end? By this we will achieve better performance, since you don't need to go through all the bytes to the end, and both UTF-16-le/be working if It's terminated.

UPDATE: To support wide-strings we can may be first search for an \x00\x00 and only then \x00

rfind wouldn't work unfortunately since the general strategy we're using here is to just read some arbitrary amount of data then search for the null byte (\x00) for example, given the data \x68\x69\x00\xff\x77\x33\x65\x00\x33\x22 using rfind we would erroneously get the position of the second \x00 giving us \x68\x69\x00\xff\x77\x33\x65 when what we actually want is \x68\x69 aka "hi"

StarrFox avatar Dec 08 '24 03:12 StarrFox

to read null-terminated strings

Gotcha, I would suggest not to add an extra argument, since you need to redefine it every time you work with UTF-16 strings, but to use rfind instead of find, do null-terminated strings have empty byte always at the end? By this we will achieve better performance, since you don't need to go through all the bytes to the end, and both UTF-16-le/be working if It's terminated. UPDATE: To support wide-strings we can may be first search for an \x00\x00 and only then \x00

rfind wouldn't work unfortunately since the general strategy we're using here is to just read some arbitrary amount of data then search for the null byte (\x00) for example, given the data \x68\x69\x00\xff\x77\x33\x65\x00\x33\x22 using rfind we would erroneously get the position of the second \x00 giving us \x68\x69\x00\xff\x77\x33\x65 when what we actually want is \x68\x69 aka "hi"

But what're the other bytes after \x68\x69, are they string related or just random? I think I don't uderstand the main idea behind read_string, from my POV It's should be a method that accepts exact string's address and length and truncates null-terminated byte, but from example above it looks more like a string search in bytes

knenkne avatar Dec 08 '24 12:12 knenkne

it's meant to read null-terminated strings which we don't know the length of, we could add another method for reading strings that we know the size of though something like

def read_sized_string(handle, address, size, encoding='UTF-8'):
    data = read_bytes(handle, address, size)
    return data.decode(encoding)

StarrFox avatar Dec 08 '24 22:12 StarrFox

it's meant to read null-terminated strings which we don't know the length of, we could add another method for reading strings that we know the size of though something like

def read_sized_string(handle, address, size, encoding='UTF-8'):
    data = read_bytes(handle, address, size)
    return data.decode(encoding)

Yeah, a new method would be a good option, do you have spare time to implement it or you need help?

knenkne avatar Dec 08 '24 22:12 knenkne

it's meant to read null-terminated strings which we don't know the length of, we could add another method for reading strings that we know the size of though something like

def read_sized_string(handle, address, size, encoding='UTF-8'):
    data = read_bytes(handle, address, size)
    return data.decode(encoding)

This would have the issue that if the encoding is one where each character is encoded to more than one byte, the size argument would have to be twice (or more) the actual length of the string which would be unintuitive (but then read_string would have this issue as well)

Another potential solution would be to re-write the current function this way:

def read_string(handle, address, byte=50, encoding='UTF-8'):
    buff = read_bytes(handle, address, byte)
    str_ = buff.decode(encoding, errors='backslashreplace')
    return str_.split('\x00', maxsplit=1)[0]

This should work for encodings like utf-16, but it will still have the issue of the number of bytes potentially not being sufficient for the length of the string.

monkeyman192 avatar Dec 08 '24 22:12 monkeyman192