Use spaCy tokenizer for Dutch
Dutch a relatively hard language to tokenize, especially when it comes to the possessive. From my testing in the past, I do prefer spaCy's tokenization though, as it does not have the stronger tendency (like NLTK) to split off an apostrophe when it is part of the possessive.
from datatrove.utils.word_tokenizers import SpaCyTokenizer, NLTKTokenizer
tokenizer = SpaCyTokenizer("nl")
print(tokenizer.word_tokenize("Ik eet die graag 's morgens. 's Anderdaags zie ik oma's kippen in Belgiës troeven. Dante’s hel en Louis’ honden."))
# ['Ik', 'eet', 'die', 'graag', "'s", 'morgens', '.', "'s", 'Anderdaags', 'zie', 'ik', "oma's", 'kippen', 'in', 'Belgiës', 'troeven', '.', 'Dante', '’s', 'hel', 'en', 'Louis', '’', 'honden', '.']
tokenizer = NLTKTokenizer("dutch")
print(tokenizer.word_tokenize("Ik eet die graag 's morgens. 's Anderdaags zie ik oma's kippen in Belgiës troeven. Dante’s hel en Louis’ honden."))
# ['Ik', 'eet', 'die', 'graag', "'s", 'morgens', '.', "'s", 'Anderdaags', 'zie', 'ik', 'oma', "'s", 'kippen', 'in', 'Belgiës', 'troeven', '.', 'Dante', '’', 's', 'hel', 'en', 'Louis', '’', 'honden', '.']
Neither of these are perfect but I prefer the spaCy one.
Thank you for the insight! We have been experimenting a bit and had actually already started swapping out nltk based tokenizers for spaCy, but it's good to have some confirmation from someone who speaks one of the affected languages.
Great, let me know if you need me to look at other things specific to Dutch!
Any chance we can get this merged @guipenedo?
Hi, closing this as this has been added during the fineweb-2 changes. Thank you again :)