Nikita Lindmann comments

Results 7 comments of


                                            Nikita Lindmann

finetuning - Out of Memory Error

For test, you may use smaller model (gpt2-large).

Source of downloading train.bin and val.bin

just prepare this files on other computer. It's not need GPU, just lots of ram and disk space. Here https://huggingface.co/datasets/openwebtext/blob/main/openwebtext.py you may find link to https://zenodo.org/record/3834942/files/openwebtext.tar.xz which is openwebtext itself....

Source of downloading train.bin and val.bin

https://gist.github.com/ramiil/389faa6798df038d349212b19259f124 here is my prepare.py, working with multiple processor cores and able to load big datasets(over 10 gb), if single file in dataset less than your ram size. It's poor...

Multi language Support

I'm working with russian dataset, and it seems like sort of success. You need: 1. Collect good dataset. Because GPT-2 from openai dont understand any language except english, you need...

Multi language Support

@nafeesmahbub nafeesmahbub I found the main problem of training model with non-english dataset. It's all because tiktoken can tokenize english text a good way, but russian text(for example) will be...

Multi language Support

``` import os from tokenizers import Tokenizer, models, trainers, pre_tokenizers tokenizer = Tokenizer(models.BPE()) max_files = 3700 for filename in os.listdir("data"): if max_files=ws: # encode with tiktoken gpt2 bpe print('[{0}] Encoding...

Multi language Support

In my previous message I found an error - this code ignores spaces. This code is valid(i hope). ``` import os import unicodedata from tokenizers import Tokenizer, models, pre_tokenizers, decoders,...