bi-att-flow
bi-att-flow copied to clipboard
process_tokens() in utils.py
Hi, I have a question about process_tokens(temp_tokens) in utils.py.
After invoke the function of process_tokens(), the punctuation in [-−—–/~"\'“’”‘°] will split temp_tokens again.
This may result in some items of "xi = [process_tokens(tokens) for tokens in xi] " whose length are 0. In other words, some items in the xi may be ""(empty string). I think this is not necessary.
From my experiment, if i remove process_tokens(), the performance will decrease. If i reserve process_tokens() and remove the empty string in xi, the performance seems to almost the same.
Thanks.
Hi @rubby33 , I am a bit confused what the issue is.. could you please clarify? Thanks!
An example: Text: the league emphasized the "golden anniversary" with various gold.... After process_tokens() the result are: 'the' 'league' 'emphasized' 'the' '' '"' '' 'golden' 'anniversary' '' '"' '' 'with' 'various'
there will be four empty string '' after the following code: xi = [process_tokens(tokens) for tokens in xi]