Nikita Lindmann
Nikita Lindmann
For test, you may use smaller model (gpt2-large).
just prepare this files on other computer. It's not need GPU, just lots of ram and disk space. Here https://huggingface.co/datasets/openwebtext/blob/main/openwebtext.py you may find link to https://zenodo.org/record/3834942/files/openwebtext.tar.xz which is openwebtext itself....
https://gist.github.com/ramiil/389faa6798df038d349212b19259f124 here is my prepare.py, working with multiple processor cores and able to load big datasets(over 10 gb), if single file in dataset less than your ram size. It's poor...
I'm working with russian dataset, and it seems like sort of success. You need: 1. Collect good dataset. Because GPT-2 from openai dont understand any language except english, you need...
@nafeesmahbub nafeesmahbub I found the main problem of training model with non-english dataset. It's all because tiktoken can tokenize english text a good way, but russian text(for example) will be...
``` import os from tokenizers import Tokenizer, models, trainers, pre_tokenizers tokenizer = Tokenizer(models.BPE()) max_files = 3700 for filename in os.listdir("data"): if max_files=ws: # encode with tiktoken gpt2 bpe print('[{0}] Encoding...
In my previous message I found an error - this code ignores spaces. This code is valid(i hope). ``` import os import unicodedata from tokenizers import Tokenizer, models, pre_tokenizers, decoders,...