YouTokenToMe icon indicating copy to clipboard operation
YouTokenToMe copied to clipboard

Unsupervised text tokenizer focused on computational efficiency

Results 42 YouTokenToMe issues
Sort by recently updated
recently updated
newest added

Hi, I have a python app that needs youtokentome indirectly through internal dependencies. I have successfully installed and run it on my own Windows 10 machine. When I wanted to...

i've tried to "pip install youtokentome" but it fails with error : Microsoft Visual C++ 14.0 or greater is required. so I tired to install C++ 14.0 or greater and...

I am trying to install Dalle-pytorch by pip install dalle-pytorch This error appears: ``` ERROR: Command errored out with exit status 1: command: 'C:\Users\Yfrite\pyProjects\NeuronV3\venv\Scripts\python.exe' -u -c 'import io, os, sys,...

I'm getting the following error when I try to install youtokentome 1.0.6 on AWS SageMaker, Cython is installed and it worked on AWS EC2 instance

I have a project where YouTokenToMe is one of the dependencies and get a `No module named 'Cython'` when trying to install YouTokenToMe as a dependency. I saw that `setup.cfg`...

Hello ! Is it possible to make youtokentome.BPE outputs compatible with official implementation of bpe_dropout ? (https://github.com/rsennrich/subword-nmt#how-implementation-differs-from-sennrich-et-al-2016) The matter is that when building the merge operations table we have the...

I'm trying to merge the dropout functionality in the R package at https://github.com/bnosac/tokenizers.bpe Getting the following errors however when compiling. ``` C:/Rtools/mingw_32/bin/g++ -std=gnu++11 -I"C:/PROGRA~1/R/R-35~1.2/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -I./youtokentome/cpp -I. -I"C:/Users/Jan/Documents/R/win-library/3.5/Rcpp/include"...

help wanted

I want to train a GPT2 model with new vocabulary. I am following instructions given here: https://github.com/mgrankin/ru_transformers. YTTM tokenizer outputs a yt.model file that has the new vocab. However the...

Hi, I have the following error when I run my script: `TypeError: decode() got an unexpected keyword argument 'ignore_ids'` However, I think I have used well the argument `ignore_ids`. How...

Right now tokenizer loads whole corpus in memory and it becomes an issue for large files. Is it possible to read corpus file line-by-line or split it in any other...