`TreebankWordDetokenizer().detokenize()` introduces unexpected spaces before periods.

Open Alnusjaponica opened this issue 2 years ago • 0 comments

Description

The TreebankWordDetokenizer().detokenize() method introduces extra spaces before periods when periods are treated as separate tokens in the input. The issue arises from the spaces added here:https://github.com/nltk/nltk/blob/d7b428daa90b41edc5adaf92755cab7aec7f5df2/nltk/tokenize/treebank.py#L362 which are not properly removed when there are words following the period.

Reproducible code

import nltk
from nltk import pos_tag, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')


text = "Lorem ipsum dolor sit amet. consectetur adipiscing elit."
d = TreebankWordDetokenizer()
tagged_words = pos_tag(word_tokenize(text))
words = [word for word, tag in tagged_words]
print(TreebankWordDetokenizer().detokenize(words))

This code snippet produces the following output:

Lorem ipsum dolor sit amet . consectetur adipiscing elit.

which contains an unexpected space before the first period.

Expected behavior

The expected output from TreebankWordDetokenizer().detokenize() should be:

Lorem ipsum dolor sit amet. consectetur adipiscing elit.

Environment

OS: macOS 14.1.1 Python: 3.11.6 nltk: 3.8.1

Dec 05 '23 05:12 Alnusjaponica