code2seq icon indicating copy to clipboard operation
code2seq copied to clipboard

'\n' mixed in Vocabulary['token']

Open hehehwang opened this issue 2 years ago • 3 comments

it seems that counter in vocabulary is counting 'token' tokens with a newline character. for example, vocabulary.pkl in java-small dataset, i can find 'return': 6020684, and 'return\n': 33290, separately.

i personally fixed this problem by stripping path_context on Vocabulary._process_raw_sample, but im little confused whether this problem(mixing '\n' in tokens) is intended.

thank you!

hehehwang avatar Sep 23 '21 10:09 hehehwang

It's interesting. But I'm not sure that this is the same return. The code was tokenized by a parser, so it should handle different indentations. I may suggest that there are different sorts of string literals with return\n inside.

SpirinEgor avatar Sep 23 '21 10:09 SpirinEgor

i don't understand what "different sorts of string literals with return\n inside." means, but i could find out lots of '*\n' tokens in vocabulary.pkl

for example, 'EMPTY\n': 11459, '<STR>\n': 11416, 'if\n': 6900, 'exception\n': 6624, ...

lots of tokens from 'token' tokens are mixed with '\n', which i assume that vocabulary parser is reading each end of the line

hehehwang avatar Oct 03 '21 18:10 hehehwang

Yeah, seems strange. I will investigate why the parser extracted tokens with new line characters in the end.

SpirinEgor avatar Oct 04 '21 10:10 SpirinEgor