Results 4 comments of Cagri Toraman

@Narsil thanks for the answer. Please try your script with this dataset to reproduce my case. `dataset = load_dataset("oscar", "unshuffled_deduplicated_tr")["train"]` As you mentioned, my dataset has many unicode since it...

[tokenizer.zip](https://github.com/huggingface/tokenizers/files/7898234/tokenizer.zip) (with the first 10k tokens)

I already filtered the dataset for non-Turkish sentences, but still got those thousands of chinese/japanese characters when I use `Whitespace`. I got rid of most of them when I used...

I do not know why, but I found that OSCAR's Turkish split has many non-Turkish webpages, probably missed by the curators. I found them by a language detector. I have...