nanoGPT
nanoGPT copied to clipboard
Multi language Support
Is there a way to train with Bengali dataset. I want to train model with 10K lines of conversation, also what to change for Bengali charset.
I'm working with russian dataset, and it seems like sort of success. You need:
- Collect good dataset. Because GPT-2 from openai dont understand any language except english, you need to train new model from zero. Prepare at least 1 gb of texts on your language, utf-8 encoded.
- Clear dataset from any unwanted data like html tags
- Modify https://github.com/karpathy/nanoGPT/blob/master/data/shakespeare/prepare.py to not download text but use your dataset as input.txt (remove lines 8-11). With really big dataset, you may have a lots of ram and disk space, and also may face problems with np.tofile(). Contact me, and i send you modified prepare.py, which may work with really big datasets.
- make copy of config/train_shakespeare_char.py to config/train_XXX.py
- Train your model using python train.py config/train_XXXX.py my settings are: --compile=False --eval_interval=50 --eval_iters=20 --log_interval=1 --block_size=128 --batch_size=12 --n_layer=48 --n_head=25 --n_embd=275 --max_iters=750 --lr_decay_iters=2000 --dropout=0.0 Remember, more n_embeds - more VRAM you need. My 1660 super with 6gb VRAM can train model with 275 embeds, which mean number of parameters: 57.42M. It's small, but if you have more VRAM - use more embeds. This settings also cause long time to one training iteration(60 secs for me), and because you will train your model from none, you need a lots of iterations before it can generate any adequate output. I recommend at least 2000 iterations, which mean 33 hours of training.
- Good luck and have fun.
@nafeesmahbub nafeesmahbub I found the main problem of training model with non-english dataset. It's all because tiktoken can tokenize english text a good way, but russian text(for example) will be tokenized for single character per token. It makes training too hard for model. I think, if I can find good tokenizer for non-english texts, it's can be way better.
Please, send me sample of texts on your language for testing my own tokenizer.
import os
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
tokenizer = Tokenizer(models.BPE())
max_files = 3700
for filename in os.listdir("data"):
if max_files<0:
break
if filename.endswith(".txt"):
max_files -= 1
with open("data\\"+filename, "r", encoding="utf-8") as f:
trainer = trainers.BpeTrainer(special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"])
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.train_from_iterator(f, trainer=trainer)
tokenizer.save("my_tokenizer.json")
This code will train tokenizer.
# -*- coding: utf-8 -*-
import os
from tokenizers import Tokenizer
import time
import multiprocessing
working_dir = os.path.dirname(os.path.realpath(__file__))
dataset = 'data'
ws = 64*1024*1024 # 128m per chunk
def chunks(arr, size):
for i in range(0, len(arr), size):
yield arr[i:i + size]
def tofile(lst, name):
with open(name, 'ab') as fh:
for i in lst:
fh.write(i.to_bytes(2, 'little'))
def process_files(pid, lst):
t = Tokenizer.from_file("my_tokenizer.json")
data = ''
for i in lst:
if not i[-4:]=='.txt':
continue
print('[{0}] {1}'.format(pid, i))
with open(working_dir+'\\'+dataset+'\\'+i, 'r', encoding="utf8") as f:
data += f.read()
if len(data)>=ws:
# encode with tiktoken gpt2 bpe
print('[{0}] Encoding {1} mb of data'.format(pid, len(data)//(1024*1024)))
ids = tokenizer.encode(data).ids
tofile(ids, dataset+'_'+str(pid)+'.bin')
ids = ''
data = ''
else:
# Save latest chunk
# encode with tiktoken gpt2 bpe
print('[{0}] Encoding {1} mb of data'.format(pid, len(data)//(1024*1024)))
ids = tokenizer.encode(data).ids
tofile(ids, dataset+'_'+str(pid)+'.bin')
ids = ''
data = ''
MAX_THREADS = multiprocessing.cpu_count()
threads = []
if __name__ == "__main__":
nowtime = time.time()
files = os.listdir(working_dir+'\\'+dataset)
for pid, ch in enumerate(chunks(files, len(files)//MAX_THREADS)):
print('Running process {0} of {1}'.format(pid, MAX_THREADS))
threads.append(multiprocessing.Process(target=process_files, args=(pid, ch)))
threads[pid].start()
for i in range(0, MAX_THREADS):
threads[i].join()
print(time.time() - nowtime)
This code will prepare dataset using pretrained tokenization model.
Also you need chande decoding process in sample.py to use tokenizers instead of tikitoken.
In my previous message I found an error - this code ignores spaces. This code is valid(i hope).
import os
import unicodedata
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
last_file = ''
if os.path.isfile('test.json'):
tokenizer = Tokenizer.from_file("test.json")
if os.path.isfile('test.json'):
with open('.last', 'r', encoding = 'utf-8') as fh:
last_file = fh.read()
else:
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Split(pattern=' ', behavior='isolated')
tokenizer.decoder = decoders.BPEDecoder()
data_dir = 'data'
file_paths = [os.path.join(data_dir, filename) for filename in os.listdir(data_dir) if filename.endswith('.txt')]
trainer = trainers.BpeTrainer(
vocab_size=50000,
min_frequency=3,
special_tokens=["<unk>", "<s>", "</s>", "<pad>"],
)
names = os.listdir('data')
if last_file != '':
pos = names.index(last_file)
names = names[pos:]
lx = len(names)
try:
for i, fname in enumerate(names):
last_file = fname
print(f'{i}/{lx}: {fname}')
with open('data\\'+fname, 'r', encoding='utf-8') as f:
tokenizer.train_from_iterator(f, trainer=trainer)
else:
tokenizer.save('test.json')
except KeyboardInterrupt:
print('Saving current model checkpoint')
tokenizer.save('test.json')
with open('.last', 'w', encoding = 'utf-8') as fh:
fh.write(last_file)