Pymem
Pymem copied to clipboard
Unable to read any UTF-16 string
Hey! I've been using pymem lately and found out that I can't read any UTF-16 strings.
Let's take a look at this byte's representation of Hello string:
UTF-8
0x48 0x65 0x6C 0x6C 0x6F
UTF-16-LE
0x48 0x00 0x65 0x00 0x6C 0x00 0x6C 0x00 0x6F 0x00
As you can see there is empty bytes added in UTF-16 strings and that's how things work, but current solution probably breaks, because of these empty appended bytes:
def read_string(handle, address, byte=50, encoding='UTF-8'):
buff = read_bytes(handle, address, byte)
i = buff.find(b'\x00')
if i != -1:
buff = buff[:i]
buff = buff.decode(encoding)
return buff
May be we can ignore this truncation for UTF-16, what you guys think?
hmm perhaps we could do something like
def read_string(handle, address, byte=50, encoding='UTF-8', search_bytes=b'\x00'):
buff = read_bytes(handle, address, byte)
i = buff.find(search_bytes)
...
I don't see many null terminated wide strings but I assume they usually end with \x00\x00?
hmm perhaps we could do something like
def read_string(handle, address, byte=50, encoding='UTF-8', search_bytes=b'\x00'): buff = read_bytes(handle, address, byte) i = buff.find(search_bytes) ...I don't see many null terminated wide strings but I assume they usually end with
\x00\x00?
Pardon me, but why do we even need this empty byte search?
to read null-terminated strings
to read null-terminated strings
Gotcha, I would suggest not to add an extra argument, since you need to redefine it every time you work with UTF-16 strings, but to use rfind instead of find, do null-terminated strings have empty byte always at the end? By this we will achieve better performance, since you don't need to go through all the bytes to the end, and both UTF-16-le/be working if It's terminated.
UPDATE: To support wide-strings we can may be first search for an \x00\x00 and only then \x00
to read null-terminated strings
Gotcha, I would suggest not to add an extra argument, since you need to redefine it every time you work with
UTF-16strings, but to userfindinstead offind, do null-terminated strings have empty byte always at the end? By this we will achieve better performance, since you don't need to go through all the bytes to the end, and bothUTF-16-le/beworking if It's terminated.UPDATE: To support wide-strings we can may be first search for an
\x00\x00and only then\x00
rfind wouldn't work unfortunately since the general strategy we're using here is to just read some arbitrary amount of data then search for the null byte (\x00)
for example, given the data \x68\x69\x00\xff\x77\x33\x65\x00\x33\x22 using rfind we would erroneously get the position of the second \x00 giving us \x68\x69\x00\xff\x77\x33\x65 when what we actually want is \x68\x69 aka "hi"
to read null-terminated strings
Gotcha, I would suggest not to add an extra argument, since you need to redefine it every time you work with
UTF-16strings, but to userfindinstead offind, do null-terminated strings have empty byte always at the end? By this we will achieve better performance, since you don't need to go through all the bytes to the end, and bothUTF-16-le/beworking if It's terminated. UPDATE: To support wide-strings we can may be first search for an\x00\x00and only then\x00
rfindwouldn't work unfortunately since the general strategy we're using here is to just read some arbitrary amount of data then search for the null byte (\x00) for example, given the data\x68\x69\x00\xff\x77\x33\x65\x00\x33\x22using rfind we would erroneously get the position of the second\x00giving us\x68\x69\x00\xff\x77\x33\x65when what we actually want is\x68\x69aka "hi"
But what're the other bytes after \x68\x69, are they string related or just random?
I think I don't uderstand the main idea behind read_string, from my POV It's should be a method that accepts exact string's address and length and truncates null-terminated byte, but from example above it looks more like a string search in bytes
it's meant to read null-terminated strings which we don't know the length of, we could add another method for reading strings that we know the size of though something like
def read_sized_string(handle, address, size, encoding='UTF-8'):
data = read_bytes(handle, address, size)
return data.decode(encoding)
it's meant to read null-terminated strings which we don't know the length of, we could add another method for reading strings that we know the size of though something like
def read_sized_string(handle, address, size, encoding='UTF-8'): data = read_bytes(handle, address, size) return data.decode(encoding)
Yeah, a new method would be a good option, do you have spare time to implement it or you need help?
it's meant to read null-terminated strings which we don't know the length of, we could add another method for reading strings that we know the size of though something like
def read_sized_string(handle, address, size, encoding='UTF-8'): data = read_bytes(handle, address, size) return data.decode(encoding)
This would have the issue that if the encoding is one where each character is encoded to more than one byte, the size argument would have to be twice (or more) the actual length of the string which would be unintuitive (but then read_string would have this issue as well)
Another potential solution would be to re-write the current function this way:
def read_string(handle, address, byte=50, encoding='UTF-8'):
buff = read_bytes(handle, address, byte)
str_ = buff.decode(encoding, errors='backslashreplace')
return str_.split('\x00', maxsplit=1)[0]
This should work for encodings like utf-16, but it will still have the issue of the number of bytes potentially not being sufficient for the length of the string.