paws
paws copied to clipboard
Tokens split by space in English text
Hi, it seems like the text of the English sentences is split by space. Like here:
[...] Preserve , known as Palos Verdes Peninsula of California .
While German texts do not have these spaces.
[...] können, sind die Ergebnisse hoch.
Can you provide the English texts without those spaces?
Hi,
Thanks for reporting the issue. Unfortunately we don't have the texts before tokenization anymore. I believe the tokenization was done by nltk.word_tokenize, the same as the one used in QQP (https://github.com/google-research-datasets/paws/blob/master/qqp_generate_data.py).