aitextgen
aitextgen copied to clipboard
Train on large textfile
Hi, I'm trying to train a model from scratch as I want it to generate text in another language (Swedish).
My trainingdata is a large collection of novels, about 22 000 that are all in one single .txt-file delimited by a line with only a <s>
The txt-file is about 300MB in size.
However, both when I try to train it from scratch using the Colab notebook(with a P100 GPU) or locally on my desktop it runs out of memory and crashes.
My desktop has 32GB RAM and a Geforce 2080Ti with 11GB VRAM.
Is there any way to make aitextgen work with 300MB trainingdata?
Are there any parameters I can tweak to have it use less memory?
Should I arrange the trainingdata in another way?
The Colab VMs have ~32GB RAM only with Colab pro, so this is a case where it's better to encode on your own system, then upload the compressed file (and reload in Colab using from_cache=True
).
If possible, use line_by_line=True
or pass the input data as a CSV + line_by_line=True
. This uses a batch encoding w/ a multithreaded algorithm that is substantially faster. Not sure if it's more memory efficient, although I assume it is.
I have though about an approach to use batch encoding on bulk texts but there are downsides.
Huh, if the script got that far that means it encoded the data properly. (you may want to use the TokenDataset
separately instead of the train()
shortcut to verify.)
It's fitting it into memory during training that's an issue. Wonder if reducing num_workers
to 1 or 2 will help save memory (currently it's double # CPU threads)
The real question here is why the training script is opening TensorFlow CUDA libraries when the library doesn't use TensorFlow at all.
(had to delete my last post as I used the wrong account) Yeah, I thought it was very strange to see it load Tensorflow-stuff :/ It might be my environment that is tainted, I should probably redo this in a clean virtualenv. But it still makes no sense that I should load tensorflow-libs from these scripts. Any way I can debug this?
@minimaxir I just check the logs from the colab-session, even the jupyter notebook loads the Tensorflow libs for some reason
Getting this same issue on a 437KB csv file with line_by_line set to true:
PS C:\Users\Angela\api> python train_bot.py --path .\dril\
2020-05-31 00:08:35.961471: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
[00:00:01] Reading files 100[00:00:01] Tokenize words 21282 / 21282[00:00:00] Count pairs 21282 / 21282[00:00:00] Compute merges 5743 / 5743
INFO:aitextgen.tokenizers:Saving aitextgen-vocab.json and aitextgen-merges.txt to dril. You will need both files to build the GPT2Tokenizer.
INFO:aitextgen.TokenDataset:17,890 texts loaded.
INFO:aitextgen:Constructing GPT-2 model from provided config.
INFO:aitextgen:Using a custom tokenizer.
GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
WARNING:lightning:No environment variable for node rank defined. Set as
0.
CUDA_VISIBLE_DEVICES: [0]
INFO:lightning:CUDA_VISIBLE_DEVICES: [0]
0%| | 0/6000 [00:00<?, ?it/s]2020-05-31 00:08:46.327705: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
Traceback (most recent call last):
File "train_bot.py", line 68, in <module>
num_workers=1)
File "C:\tools\miniconda3\lib\site-packages\aitextgen\aitextgen.py", line 563, in train
trainer.fit(train_model)
File "C:\tools\miniconda3\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 859, in fit
self.single_gpu_train(model)
File "C:\tools\miniconda3\lib\site-packages\pytorch_lightning\trainer\distrib_parts.py", line 503, in single_gpu_train
self.run_pretrain_routine(model)
File "C:\tools\miniconda3\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1015, in run_pretrain_routine
self.train()
File "C:\tools\miniconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 347, in train
self.run_training_epoch()
File "C:\tools\miniconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 419, in run_training_epoch
_outputs = self.run_training_batch(batch, batch_idx)
File "C:\tools\miniconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 597, in run_training_batch
loss, batch_output = optimizer_closure()
File "C:\tools\miniconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 561, in optimizer_closure
output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
output = self.model.training_step(*args)
File "C:\tools\miniconda3\lib\site-packages\aitextgen\train.py", line
39, in training_step
outputs = self({"input_ids": batch, "labels": batch})
File "C:\tools\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "C:\tools\miniconda3\lib\site-packages\aitextgen\train.py", line
34, in forward
return self.model(**inputs)
File "C:\tools\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "C:\tools\miniconda3\lib\site-packages\transformers\modeling_gpt2.py", line 624, in forward
shift_logits = lm_logits[..., :-1, :].contiguous()
RuntimeError: CUDA out of memory. Tried to allocate 746.00 MiB (GPU 0; 8.00 GiB total capacity; 3.48 GiB already allocated; 0 bytes free; 17.35
MiB cached)
0%| | 0/6000 [00:05<?, ?it/s]
Called:
ai.train(data,
line_by_line=True,
output_dir=path,
batch_size=256,
num_steps=8000,
save_every=2000,
num_workers=1
)
Update: Got it working by setting batch_size=128, num_workers=2, and running:
python
import torch
torch.cuda.empty_cache()
If you had to empty the cache it may be unrelated. I guess it won't hurt to add an automatic clear before training.
@minimaxir, so this thread discusses pretty much the same thing. I have a large file, 1.5 GB specifically of Arabic text that I want to train on.
Is there a way this library could handle a file this size? Like training on one batch at a time or split the file into chunks and feed the trainer gradually?
@minimaxir Thanks for the wonderful blog article and also this library.
I started using last evening and I had a 60MB file and independently worked through the same issues on this thread, though I am learning new things on this thread. Basically, I split out the tokenizer. I would suggest you could update the Colab notebook that way so more users who pick it up dont face same issue.
I came here to see if there is a plan to add support to take in a folder containing multiple files?
GPT2 encode.py can take in either a file OR a folder containing files. And if there is an npz file already, it is simply appended. That would be a good feature.
Thanks again.
same here i had a 80MB file and it throw OOM error
Hi, I'm trying to train a model from scratch as I want it to generate text in another language (Swedish). My trainingdata is a large collection of novels, about 22 000 that are all in one single .txt-file delimited by a line with only a
<s>
The txt-file is about 300MB in size. However, both when I try to train it from scratch using the Colab notebook(with a P100 GPU) or locally on my desktop it runs out of memory and crashes. My desktop has 32GB RAM and a Geforce 2080Ti with 11GB VRAM.Is there any way to make aitextgen work with 300MB trainingdata? Are there any parameters I can tweak to have it use less memory? Should I arrange the trainingdata in another way?
what is your max_length and n_embd size. the only fix i can say without them is just remove batch_size