YouTokenToMe
YouTokenToMe copied to clipboard
Unsupervised text tokenizer focused on computational efficiency
Hi, I have a python app that needs youtokentome indirectly through internal dependencies. I have successfully installed and run it on my own Windows 10 machine. When I wanted to...
i've tried to "pip install youtokentome" but it fails with error : Microsoft Visual C++ 14.0 or greater is required. so I tired to install C++ 14.0 or greater and...
I am trying to install Dalle-pytorch by pip install dalle-pytorch This error appears: ``` ERROR: Command errored out with exit status 1: command: 'C:\Users\Yfrite\pyProjects\NeuronV3\venv\Scripts\python.exe' -u -c 'import io, os, sys,...
I'm getting the following error when I try to install youtokentome 1.0.6 on AWS SageMaker, Cython is installed and it worked on AWS EC2 instance
I have a project where YouTokenToMe is one of the dependencies and get a `No module named 'Cython'` when trying to install YouTokenToMe as a dependency. I saw that `setup.cfg`...
Hello ! Is it possible to make youtokentome.BPE outputs compatible with official implementation of bpe_dropout ? (https://github.com/rsennrich/subword-nmt#how-implementation-differs-from-sennrich-et-al-2016) The matter is that when building the merge operations table we have the...
I'm trying to merge the dropout functionality in the R package at https://github.com/bnosac/tokenizers.bpe Getting the following errors however when compiling. ``` C:/Rtools/mingw_32/bin/g++ -std=gnu++11 -I"C:/PROGRA~1/R/R-35~1.2/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -I./youtokentome/cpp -I. -I"C:/Users/Jan/Documents/R/win-library/3.5/Rcpp/include"...
I want to train a GPT2 model with new vocabulary. I am following instructions given here: https://github.com/mgrankin/ru_transformers. YTTM tokenizer outputs a yt.model file that has the new vocab. However the...
Hi, I have the following error when I run my script: `TypeError: decode() got an unexpected keyword argument 'ignore_ids'` However, I think I have used well the argument `ignore_ids`. How...
Right now tokenizer loads whole corpus in memory and it becomes an issue for large files. Is it possible to read corpus file line-by-line or split it in any other...