datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Use spaCy tokenizer for Dutch

Open BramVanroy opened this issue 1 year ago • 3 comments

Dutch a relatively hard language to tokenize, especially when it comes to the possessive. From my testing in the past, I do prefer spaCy's tokenization though, as it does not have the stronger tendency (like NLTK) to split off an apostrophe when it is part of the possessive.

from datatrove.utils.word_tokenizers import SpaCyTokenizer, NLTKTokenizer

tokenizer = SpaCyTokenizer("nl")

print(tokenizer.word_tokenize("Ik eet die graag 's morgens. 's Anderdaags zie ik oma's kippen in Belgiës troeven. Dante’s hel en Louis’ honden."))
# ['Ik', 'eet', 'die', 'graag', "'s", 'morgens', '.', "'s", 'Anderdaags', 'zie', 'ik', "oma's", 'kippen', 'in', 'Belgiës', 'troeven', '.', 'Dante', '’s', 'hel', 'en', 'Louis', '’', 'honden', '.']


tokenizer = NLTKTokenizer("dutch")
print(tokenizer.word_tokenize("Ik eet die graag 's morgens. 's Anderdaags zie ik oma's kippen in Belgiës troeven. Dante’s hel en Louis’ honden."))
# ['Ik', 'eet', 'die', 'graag', "'s", 'morgens', '.', "'s", 'Anderdaags', 'zie', 'ik', 'oma', "'s", 'kippen', 'in', 'Belgiës', 'troeven', '.', 'Dante', '’', 's', 'hel', 'en', 'Louis', '’', 'honden', '.']

Neither of these are perfect but I prefer the spaCy one.

BramVanroy avatar Sep 04 '24 19:09 BramVanroy

Thank you for the insight! We have been experimenting a bit and had actually already started swapping out nltk based tokenizers for spaCy, but it's good to have some confirmation from someone who speaks one of the affected languages.

guipenedo avatar Sep 05 '24 08:09 guipenedo

Great, let me know if you need me to look at other things specific to Dutch!

BramVanroy avatar Sep 05 '24 09:09 BramVanroy

Any chance we can get this merged @guipenedo?

BramVanroy avatar Sep 14 '24 15:09 BramVanroy

Hi, closing this as this has been added during the fineweb-2 changes. Thank you again :)

guipenedo avatar Dec 20 '24 12:12 guipenedo