nltk
nltk copied to clipboard
`TreebankWordDetokenizer().detokenize()` introduces unexpected spaces before periods.
Description
The TreebankWordDetokenizer().detokenize() method introduces extra spaces before periods when periods are treated as separate tokens in the input. The issue arises from the spaces added here:https://github.com/nltk/nltk/blob/d7b428daa90b41edc5adaf92755cab7aec7f5df2/nltk/tokenize/treebank.py#L362
which are not properly removed when there are words following the period.
Reproducible code
import nltk
from nltk import pos_tag, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
text = "Lorem ipsum dolor sit amet. consectetur adipiscing elit."
d = TreebankWordDetokenizer()
tagged_words = pos_tag(word_tokenize(text))
words = [word for word, tag in tagged_words]
print(TreebankWordDetokenizer().detokenize(words))
This code snippet produces the following output:
Lorem ipsum dolor sit amet . consectetur adipiscing elit.
which contains an unexpected space before the first period.
Expected behavior
The expected output from TreebankWordDetokenizer().detokenize() should be:
Lorem ipsum dolor sit amet. consectetur adipiscing elit.
Environment
OS: macOS 14.1.1 Python: 3.11.6 nltk: 3.8.1