cdec
cdec copied to clipboard
tokenize-anything.sh on Italian
Hi, I just wanted to let you know an error the tokenize-anything.sh script makes for Italian sentences, that is it doesn't split "C'è" ("There's").
This also applies to other contractions whose second part is "'è".
Examples of other contractions that should be split, but aren't:
l'uomo all'interno nell'obligo
These involve articles. Before a vowel, definite articles are spelled l'. Combining with prepositions yields all', dall', dell', nell', sull'. The feminine indefinite article is realized as un' before a vowel.