javalang
javalang copied to clipboard
String values don't properly handle unicode escapes
I am using javalang to tokenize files which include Unicode escape sequences. These are correctly tokenized as strings, but the item.value is not handled cleanly. Consider the 2 cases below: Case 1: builder.append(text, 0, MAX_TEXT).append('\u2026'); Case 2: builder.append(text, 0, MAX_TEXT).append('…');
In both cases, item.value is identical and I get an exception if I try to write the item.value to a file. I can catch the error and successfully print using python like this:
if (token_type == 'String'):
try:
outfile.write(item.value)
except UnicodeEncodeError:
outfile.write(item.value.encode('unicode-escape').decode('utf-8'))
but the python code above prints the same value for Case 1 and 2. I suspect the proper fix is to use raw strings for String token values internal to javalang. Below is an example of raw strings solving the problem.
>>> str1 = '…'
>>> str2 = '\u2026'
>>> print("str1: ",str1," str2:",str2)
str1: … str2: …
>>> str1 == str2
True
>>> str1 = r'…'
>>> str2 = r'\u2026'
>>> print("str1: ",str1," str2:",str2)
str1: … str2: \u2026
>>> str1 == str2
False
Hi Steve,
If you change the code at line 534 in tokenizer.py
to:
#self.pre_tokenize()
self.data = ''.join(self.decode_data())
self.length = len(self.data)
The unicode string will be stored as raw string, not converted to characters.
And one more benefit is that the position will also be correct for files containing unicode. I found this when I tried to debug the position error.
Good discovery, thanks.
Regards, Steve
From: chenzimin Sent: Wednesday, February 6, 2019 2:31 AM To: c2nes/javalang Cc: Steve Kommrusch; Author Subject: Re: [c2nes/javalang] String values don't properly handle unicodeescapes (#58)
Hi Steve, If you comment out this line, https://github.com/c2nes/javalang/blob/7a4af7f5136dd4f4f4b1846b3872f5688429e5db/javalang/tokenizer.py#L489, the unicode string will be stored as raw string, not converted to characters. And one more benefit is that the position will also be correct for files containing unicode. I found this when I tried to debug the position error. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.