javalang icon indicating copy to clipboard operation
javalang copied to clipboard

String values don't properly handle unicode escapes

Open SteveKommrusch opened this issue 6 years ago • 2 comments

I am using javalang to tokenize files which include Unicode escape sequences. These are correctly tokenized as strings, but the item.value is not handled cleanly. Consider the 2 cases below: Case 1: builder.append(text, 0, MAX_TEXT).append('\u2026'); Case 2: builder.append(text, 0, MAX_TEXT).append('…');

In both cases, item.value is identical and I get an exception if I try to write the item.value to a file. I can catch the error and successfully print using python like this:

      if (token_type == 'String'):
          try:
              outfile.write(item.value)
          except UnicodeEncodeError:
              outfile.write(item.value.encode('unicode-escape').decode('utf-8'))

but the python code above prints the same value for Case 1 and 2. I suspect the proper fix is to use raw strings for String token values internal to javalang. Below is an example of raw strings solving the problem.

>>> str1 = '…'
>>> str2 = '\u2026'
>>> print("str1: ",str1," str2:",str2)
str1:  …  str2: …
>>> str1 == str2
True
>>> str1 = r'…'
>>> str2 = r'\u2026'
>>> print("str1: ",str1," str2:",str2)
str1:  …  str2: \u2026
>>> str1 == str2
False

SteveKommrusch avatar Oct 31 '18 15:10 SteveKommrusch

Hi Steve,

If you change the code at line 534 in tokenizer.py to:

#self.pre_tokenize()
self.data = ''.join(self.decode_data())
self.length = len(self.data)

The unicode string will be stored as raw string, not converted to characters.

And one more benefit is that the position will also be correct for files containing unicode. I found this when I tried to debug the position error.

chenzimin avatar Feb 06 '19 09:02 chenzimin

Good discovery, thanks.

Regards, Steve

From: chenzimin Sent: Wednesday, February 6, 2019 2:31 AM To: c2nes/javalang Cc: Steve Kommrusch; Author Subject: Re: [c2nes/javalang] String values don't properly handle unicodeescapes (#58)

Hi Steve, If you comment out this line, https://github.com/c2nes/javalang/blob/7a4af7f5136dd4f4f4b1846b3872f5688429e5db/javalang/tokenizer.py#L489, the unicode string will be stored as raw string, not converted to characters. And one more benefit is that the position will also be correct for files containing unicode. I found this when I tried to debug the position error. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

SteveKommrusch avatar Feb 07 '19 17:02 SteveKommrusch