paws icon indicating copy to clipboard operation
paws copied to clipboard

Tokens split by space in English text

Open PhilipMay opened this issue 3 years ago • 1 comments

Hi, it seems like the text of the English sentences is split by space. Like here:

[...] Preserve , known as Palos Verdes Peninsula of California .

While German texts do not have these spaces.

[...] können, sind die Ergebnisse hoch.

Can you provide the English texts without those spaces?

PhilipMay avatar Sep 13 '21 17:09 PhilipMay

Hi,

Thanks for reporting the issue. Unfortunately we don't have the texts before tokenization anymore. I believe the tokenization was done by nltk.word_tokenize, the same as the one used in QQP (https://github.com/google-research-datasets/paws/blob/master/qqp_generate_data.py).

yuanzh avatar Sep 13 '21 20:09 yuanzh