bi-att-flow icon indicating copy to clipboard operation
bi-att-flow copied to clipboard

process_tokens() in utils.py

Open rubby33 opened this issue 7 years ago • 2 comments

Hi, I have a question about process_tokens(temp_tokens) in utils.py.

After invoke the function of process_tokens(), the punctuation in [-−—–/~"\'“’”‘°] will split temp_tokens again.

This may result in some items of "xi = [process_tokens(tokens) for tokens in xi] " whose length are 0. In other words, some items in the xi may be ""(empty string). I think this is not necessary.

From my experiment, if i remove process_tokens(), the performance will decrease. If i reserve process_tokens() and remove the empty string in xi, the performance seems to almost the same.

Thanks.

rubby33 avatar May 04 '17 02:05 rubby33

Hi @rubby33 , I am a bit confused what the issue is.. could you please clarify? Thanks!

seominjoon avatar Jun 22 '17 21:06 seominjoon

An example: Text: the league emphasized the "golden anniversary" with various gold.... After process_tokens() the result are: 'the' 'league' 'emphasized' 'the' '' '"' '' 'golden' 'anniversary' '' '"' '' 'with' 'various'

there will be four empty string '' after the following code: xi = [process_tokens(tokens) for tokens in xi]

rubby33 avatar Jun 23 '17 03:06 rubby33