code2seq
code2seq copied to clipboard
'\n' mixed in Vocabulary['token']
it seems that counter in vocabulary is counting 'token' tokens with a newline character. for example, vocabulary.pkl in java-small dataset, i can find 'return': 6020684, and 'return\n': 33290, separately.
i personally fixed this problem by stripping path_context on Vocabulary._process_raw_sample, but im little confused whether this problem(mixing '\n' in tokens) is intended.
thank you!
It's interesting. But I'm not sure that this is the same return
. The code was tokenized by a parser, so it should handle different indentations. I may suggest that there are different sorts of string literals with return\n
inside.
i don't understand what "different sorts of string literals with return\n inside." means, but i could find out lots of '*\n' tokens in vocabulary.pkl
for example, 'EMPTY\n': 11459, '<STR>\n': 11416, 'if\n': 6900, 'exception\n': 6624, ...
lots of tokens from 'token' tokens are mixed with '\n', which i assume that vocabulary parser is reading each end of the line
Yeah, seems strange. I will investigate why the parser extracted tokens with new line characters in the end.