Cagri Toraman comments

Repositories
Issues
Comments

Results 4 comments of


                                            Cagri Toraman

vocab_size issue with Whitespace pre_tokenizer

@Narsil thanks for the answer. Please try your script with this dataset to reproduce my case. `dataset = load_dataset("oscar", "unshuffled_deduplicated_tr")["train"]` As you mentioned, my dataset has many unicode since it...

vocab_size issue with Whitespace pre_tokenizer

[tokenizer.zip](https://github.com/huggingface/tokenizers/files/7898234/tokenizer.zip) (with the first 10k tokens)

vocab_size issue with Whitespace pre_tokenizer

I already filtered the dataset for non-Turkish sentences, but still got those thousands of chinese/japanese characters when I use `Whitespace`. I got rid of most of them when I used...

vocab_size issue with Whitespace pre_tokenizer

I do not know why, but I found that OSCAR's Turkish split has many non-Turkish webpages, probably missed by the curators. I found them by a language detector. I have...