parallel-corpora-tools
parallel-corpora-tools copied to clipboard
Remove character-level tokenized words
Remove sentences where the number of non-space characters is equal (or very close?) to the number of tokens.
English
( c o n t i n u a t i o n )
Slovenian
( n a d a l j e v a n j e )