YouTokenToMe
YouTokenToMe copied to clipboard
Unsupervised text tokenizer focused on computational efficiency
Hi, In the original BPE paper, as well as in the BPE dropout paper, the authors apply word-based tokenization (namely, the Moses tokenizer, as well as some others) before the...
Hi! I'm trying to install this module using the following code: ``` pip install youtokentome ``` ...but I got an error saying that the specified path does not exist. ```...
Hello, I took a look at [benchmarks page](https://github.com/VKCOM/YouTokenToMe/blob/master/benchmark.md) I wanted to know how YouTokenToMe's speed compares to [subword-nmt](https://github.com/rsennrich/subword-nmt) and is there a reason why it was left out of the...
I want to use this yttm model. However, I want to add [MASK] token to the vocabulary. In this case, How can I predefine special tokens?
I want to use YouTokenToMe for fast id encoding, but I need to do it with embeddings taken from here : https://nlp.h-its.org/bpemb/ Obviously, there is a pre-defined vocab there. Right...
In `YouTokeToMe` BPE-dropout is always the same for the same input. That contradicts the idea described in the paper: ``` During segmentation, at each merge step some merges are randomly...
I would like to know if there is any possibility to control the splitting of a word into tokens, besides setting up bpe dropout? For example: "Best" can be tokenized...
How to train with multiple corpus files? Without merging files together, is it possible?
Hi, I am very confused how the vocab function works. It seems the vocab only reads a model file, which only contains the token id without the mapping (the id...
Any idea why this would happen? ```bash Training parameters input: _training_aux model: vocab.bpe vocab_size: 24000 n_threads: 8 character_coverage: 1 pad: 0 unk: 1 bos: 2 eos: 3 reading file... learning...