code_tokenize
code_tokenize copied to clipboard
Unexpected token with no matched text at the end of tokenstream
Consider the following code:
text = """private void unlockMap(Player player) {
TowerData towerData = player.getTowerData();
if (!towerData.getClass().equals(TowerData.class)) {
CommandHandler.sendTranslatedMessage(player, "commands.generic.no_permissions");
} else {
if (towerData."""
import code_tokenize as ctok
tokens = ctok.tokenize(text, lang='java', syntax_error='ignore')
assert tokens[-1].type == '.', (tokens[-1].type, tokens[-1].text)
The result of executing the above code is:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[23], line 11
8 import code_tokenize as ctok
9 tokens = ctok.tokenize(text, lang='java', syntax_error='ignore')
---> 11 assert tokens[-1].type == '.', (tokens[-1].type, tokens[-1].text)
AssertionError: ('}', '')
There is an additional token of type '}' in the end of the tokenstream, and it doesn't match any text. The tokenstream upto the last token is as expected:
[private,
void,
unlockMap,
,
(,
Player,
player,
),
{,
TowerData,
towerData,
=,
player,
.,
getTowerData,
(,
),
;,
if,
(,
!,
towerData,
.,
getClass,
(,
),
.,
equals,
(,
TowerData,
.,
class,
),
),
{,
CommandHandler,
.,
sendTranslatedMessage,
(,
player,
,,
"commands.generic.no_permissions",
),
;,
},
else,
{,
if,
(,
towerData,
.,
]
Noticed that the the fourth token in the above stream is also empty '' and of type ';' which is unexpected
Hey! Thank you for pointing this out!
Note that code_tokenize always tries to construct the AST/CST (based on tree-sitter) before tokenization. Since tree-sitter is a best-effort parser, it might inject nodes to match the grammar which sometimes end up in the token stream.
If you parse a syntactically incorrect program, you can easily filter these fake nodes by removing all tokens that are marked as error nodes:
token.ast_node.has_error # Returns True for error nodes and False otherwise
Thanks a lot for your reply. Apart from your suggested check above, token.ast_node.has_error
, is it okay to remove tokens that do not match any text?
[t for t in ctok.tokenize(code_text, lang='java', syntax_error='ignore') if t.text != '']
Hey! I think you should be fine for now since no real token should match any empty string. However, if you want to be safe, I would still go with checking for error nodes.