YouTokenToMe icon indicating copy to clipboard operation
YouTokenToMe copied to clipboard

Unsupervised text tokenizer focused on computational efficiency

Results 42 YouTokenToMe issues
Sort by recently updated
recently updated
newest added

Hi, In the original BPE paper, as well as in the BPE dropout paper, the authors apply word-based tokenization (namely, the Moses tokenizer, as well as some others) before the...

Hi! I'm trying to install this module using the following code: ``` pip install youtokentome ``` ...but I got an error saying that the specified path does not exist. ```...

Hello, I took a look at [benchmarks page](https://github.com/VKCOM/YouTokenToMe/blob/master/benchmark.md) I wanted to know how YouTokenToMe's speed compares to [subword-nmt](https://github.com/rsennrich/subword-nmt) and is there a reason why it was left out of the...

I want to use this yttm model. However, I want to add [MASK] token to the vocabulary. In this case, How can I predefine special tokens?

enhancement

I want to use YouTokenToMe for fast id encoding, but I need to do it with embeddings taken from here : https://nlp.h-its.org/bpemb/ Obviously, there is a pre-defined vocab there. Right...

In `YouTokeToMe` BPE-dropout is always the same for the same input. That contradicts the idea described in the paper: ``` During segmentation, at each merge step some merges are randomly...

enhancement

I would like to know if there is any possibility to control the splitting of a word into tokens, besides setting up bpe dropout? For example: "Best" can be tokenized...

How to train with multiple corpus files? Without merging files together, is it possible?

Hi, I am very confused how the vocab function works. It seems the vocab only reads a model file, which only contains the token id without the mapping (the id...

Any idea why this would happen? ```bash Training parameters input: _training_aux model: vocab.bpe vocab_size: 24000 n_threads: 8 character_coverage: 1 pad: 0 unk: 1 bos: 2 eos: 3 reading file... learning...