nanoGPT Multi language Support

Is there a way to train with Bengali dataset. I want to train model with 10K lines of conversation, also what to change for Bengali charset.

Mar 02 '23 06:03 nafeesmahbub

I'm working with russian dataset, and it seems like sort of success. You need:

Collect good dataset. Because GPT-2 from openai dont understand any language except english, you need to train new model from zero. Prepare at least 1 gb of texts on your language, utf-8 encoded.
Clear dataset from any unwanted data like html tags
Modify https://github.com/karpathy/nanoGPT/blob/master/data/shakespeare/prepare.py to not download text but use your dataset as input.txt (remove lines 8-11). With really big dataset, you may have a lots of ram and disk space, and also may face problems with np.tofile(). Contact me, and i send you modified prepare.py, which may work with really big datasets.
make copy of config/train_shakespeare_char.py to config/train_XXX.py
Train your model using python train.py config/train_XXXX.py my settings are: --compile=False --eval_interval=50 --eval_iters=20 --log_interval=1 --block_size=128 --batch_size=12 --n_layer=48 --n_head=25 --n_embd=275 --max_iters=750 --lr_decay_iters=2000 --dropout=0.0 Remember, more n_embeds - more VRAM you need. My 1660 super with 6gb VRAM can train model with 275 embeds, which mean number of parameters: 57.42M. It's small, but if you have more VRAM - use more embeds. This settings also cause long time to one training iteration(60 secs for me), and because you will train your model from none, you need a lots of iterations before it can generate any adequate output. I recommend at least 2000 iterations, which mean 33 hours of training.
Good luck and have fun.

Mar 20 '23 13:03 ramiil

@nafeesmahbub nafeesmahbub I found the main problem of training model with non-english dataset. It's all because tiktoken can tokenize english text a good way, but russian text(for example) will be tokenized for single character per token. It makes training too hard for model. I think, if I can find good tokenizer for non-english texts, it's can be way better.

Please, send me sample of texts on your language for testing my own tokenizer.

Mar 24 '23 16:03 ramiil

import os
from tokenizers import Tokenizer, models, trainers, pre_tokenizers

tokenizer = Tokenizer(models.BPE())

max_files = 3700

for filename in os.listdir("data"):
	if max_files<0:
		break
	if filename.endswith(".txt"):
		max_files -= 1
		with open("data\\"+filename, "r", encoding="utf-8") as f:
			trainer = trainers.BpeTrainer(special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"])
			tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
			tokenizer.train_from_iterator(f, trainer=trainer)

tokenizer.save("my_tokenizer.json")

This code will train tokenizer.

# -*- coding: utf-8 -*-

import os
from tokenizers import Tokenizer
import time
import multiprocessing

working_dir = os.path.dirname(os.path.realpath(__file__))
dataset = 'data'
ws = 64*1024*1024 # 128m per chunk

def chunks(arr, size):
	for i in range(0, len(arr), size):
		yield arr[i:i + size]

def tofile(lst, name):
	with open(name, 'ab') as fh:
		for i in lst:
			fh.write(i.to_bytes(2, 'little'))


def process_files(pid, lst):
	t = Tokenizer.from_file("my_tokenizer.json")
	data = ''
	for i in lst:
		if not i[-4:]=='.txt':
			continue
		print('[{0}] {1}'.format(pid, i))
		with open(working_dir+'\\'+dataset+'\\'+i, 'r', encoding="utf8") as f:
				data += f.read()
		if len(data)>=ws:
			# encode with tiktoken gpt2 bpe
			print('[{0}]  Encoding {1} mb of data'.format(pid, len(data)//(1024*1024)))
			ids = tokenizer.encode(data).ids
			tofile(ids, dataset+'_'+str(pid)+'.bin')
			ids = ''
			data = ''
	else:
		# Save latest chunk
		# encode with tiktoken gpt2 bpe
		print('[{0}]  Encoding {1} mb of data'.format(pid, len(data)//(1024*1024)))
		ids = tokenizer.encode(data).ids
		tofile(ids, dataset+'_'+str(pid)+'.bin')
		ids = ''
		data = ''

MAX_THREADS = multiprocessing.cpu_count()
threads = []

if __name__ == "__main__":
	nowtime = time.time()
	files = os.listdir(working_dir+'\\'+dataset)
	for pid, ch in enumerate(chunks(files, len(files)//MAX_THREADS)):
		print('Running process {0} of {1}'.format(pid, MAX_THREADS))
		threads.append(multiprocessing.Process(target=process_files, args=(pid, ch)))
		threads[pid].start()
	
	for i in range(0, MAX_THREADS):
		threads[i].join()
	
	print(time.time() - nowtime)

This code will prepare dataset using pretrained tokenization model.

Also you need chande decoding process in sample.py to use tokenizers instead of tikitoken.

Mar 25 '23 16:03 ramiil

In my previous message I found an error - this code ignores spaces. This code is valid(i hope).

import os
import unicodedata
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

last_file = ''

if os.path.isfile('test.json'):
	tokenizer = Tokenizer.from_file("test.json")
	if os.path.isfile('test.json'):
		with open('.last', 'r', encoding = 'utf-8') as fh:
			last_file = fh.read()
else:
	tokenizer = Tokenizer(models.BPE())
	tokenizer.pre_tokenizer = pre_tokenizers.Split(pattern=' ', behavior='isolated')
	tokenizer.decoder = decoders.BPEDecoder()

data_dir = 'data'
file_paths = [os.path.join(data_dir, filename) for filename in os.listdir(data_dir) if filename.endswith('.txt')]


trainer = trainers.BpeTrainer(
    vocab_size=50000,
    min_frequency=3,
    special_tokens=["<unk>", "<s>", "</s>", "<pad>"],
)
names = os.listdir('data')

if last_file != '':
	pos = names.index(last_file)
	names = names[pos:]
lx = len(names)


try:
	for i, fname in enumerate(names):
		last_file = fname
		print(f'{i}/{lx}: {fname}')
		with open('data\\'+fname, 'r', encoding='utf-8') as f:
			tokenizer.train_from_iterator(f, trainer=trainer)
	else:
		tokenizer.save('test.json')
		
except KeyboardInterrupt:
	print('Saving current model checkpoint')
	tokenizer.save('test.json')
	with open('.last', 'w', encoding = 'utf-8') as fh:
		fh.write(last_file)

Apr 09 '23 20:04 ramiil

nanoGPT nanoGPT copied to clipboard

Multi language Support

nanoGPT
nanoGPT copied to clipboard