datatrove Use spaCy tokenizer for Dutch

Dutch a relatively hard language to tokenize, especially when it comes to the possessive. From my testing in the past, I do prefer spaCy's tokenization though, as it does not have the stronger tendency (like NLTK) to split off an apostrophe when it is part of the possessive.

from datatrove.utils.word_tokenizers import SpaCyTokenizer, NLTKTokenizer

tokenizer = SpaCyTokenizer("nl")

print(tokenizer.word_tokenize("Ik eet die graag 's morgens. 's Anderdaags zie ik oma's kippen in Belgiës troeven. Dante’s hel en Louis’ honden."))
# ['Ik', 'eet', 'die', 'graag', "'s", 'morgens', '.', "'s", 'Anderdaags', 'zie', 'ik', "oma's", 'kippen', 'in', 'Belgiës', 'troeven', '.', 'Dante', '’s', 'hel', 'en', 'Louis', '’', 'honden', '.']


tokenizer = NLTKTokenizer("dutch")
print(tokenizer.word_tokenize("Ik eet die graag 's morgens. 's Anderdaags zie ik oma's kippen in Belgiës troeven. Dante’s hel en Louis’ honden."))
# ['Ik', 'eet', 'die', 'graag', "'s", 'morgens', '.', "'s", 'Anderdaags', 'zie', 'ik', 'oma', "'s", 'kippen', 'in', 'Belgiës', 'troeven', '.', 'Dante', '’', 's', 'hel', 'en', 'Louis', '’', 'honden', '.']

Neither of these are perfect but I prefer the spaCy one.

Sep 04 '24 19:09 BramVanroy

Thank you for the insight! We have been experimenting a bit and had actually already started swapping out nltk based tokenizers for spaCy, but it's good to have some confirmation from someone who speaks one of the affected languages.

Sep 05 '24 08:09 guipenedo

Great, let me know if you need me to look at other things specific to Dutch!

Sep 05 '24 09:09 BramVanroy

Any chance we can get this merged @guipenedo?

Sep 14 '24 15:09 BramVanroy

Hi, closing this as this has been added during the fineweb-2 changes. Thank you again :)

Dec 20 '24 12:12 guipenedo