aitextgen icon indicating copy to clipboard operation
aitextgen copied to clipboard

Train on large textfile

Open ZerxXxes opened this issue 4 years ago • 11 comments

Hi, I'm trying to train a model from scratch as I want it to generate text in another language (Swedish). My trainingdata is a large collection of novels, about 22 000 that are all in one single .txt-file delimited by a line with only a <s> The txt-file is about 300MB in size. However, both when I try to train it from scratch using the Colab notebook(with a P100 GPU) or locally on my desktop it runs out of memory and crashes. My desktop has 32GB RAM and a Geforce 2080Ti with 11GB VRAM.

Is there any way to make aitextgen work with 300MB trainingdata? Are there any parameters I can tweak to have it use less memory? Should I arrange the trainingdata in another way? image

ZerxXxes avatar May 19 '20 17:05 ZerxXxes

The Colab VMs have ~32GB RAM only with Colab pro, so this is a case where it's better to encode on your own system, then upload the compressed file (and reload in Colab using from_cache=True).

If possible, use line_by_line=True or pass the input data as a CSV + line_by_line=True. This uses a batch encoding w/ a multithreaded algorithm that is substantially faster. Not sure if it's more memory efficient, although I assume it is.

I have though about an approach to use batch encoding on bulk texts but there are downsides.

minimaxir avatar May 19 '20 17:05 minimaxir

Huh, if the script got that far that means it encoded the data properly. (you may want to use the TokenDataset separately instead of the train() shortcut to verify.)

It's fitting it into memory during training that's an issue. Wonder if reducing num_workers to 1 or 2 will help save memory (currently it's double # CPU threads)

The real question here is why the training script is opening TensorFlow CUDA libraries when the library doesn't use TensorFlow at all.

minimaxir avatar May 19 '20 18:05 minimaxir

(had to delete my last post as I used the wrong account) Yeah, I thought it was very strange to see it load Tensorflow-stuff :/ It might be my environment that is tainted, I should probably redo this in a clean virtualenv. But it still makes no sense that I should load tensorflow-libs from these scripts. Any way I can debug this?

ZerxXxes avatar May 19 '20 18:05 ZerxXxes

@minimaxir I just check the logs from the colab-session, even the jupyter notebook loads the Tensorflow libs for some reason image

ZerxXxes avatar May 19 '20 19:05 ZerxXxes

Getting this same issue on a 437KB csv file with line_by_line set to true:

    PS C:\Users\Angela\api> python train_bot.py --path .\dril\
2020-05-31 00:08:35.961471: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
[00:00:01] Reading files                                             100[00:00:01] Tokenize words                            21282    /    21282[00:00:00] Count pairs                               21282    /    21282[00:00:00] Compute merges                            5743     /     5743
INFO:aitextgen.tokenizers:Saving aitextgen-vocab.json and aitextgen-merges.txt to dril. You will need both files to build the GPT2Tokenizer.
INFO:aitextgen.TokenDataset:17,890 texts loaded.
INFO:aitextgen:Constructing GPT-2 model from provided config.
INFO:aitextgen:Using a custom tokenizer.
GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
WARNING:lightning:No environment variable for node rank defined. Set as
0.
CUDA_VISIBLE_DEVICES: [0]
INFO:lightning:CUDA_VISIBLE_DEVICES: [0]
0%|                                         | 0/6000 [00:00<?, ?it/s]2020-05-31 00:08:46.327705: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
Traceback (most recent call last):
File "train_bot.py", line 68, in <module>
    num_workers=1)
File "C:\tools\miniconda3\lib\site-packages\aitextgen\aitextgen.py", line 563, in train
    trainer.fit(train_model)
File "C:\tools\miniconda3\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 859, in fit
    self.single_gpu_train(model)
File "C:\tools\miniconda3\lib\site-packages\pytorch_lightning\trainer\distrib_parts.py", line 503, in single_gpu_train
    self.run_pretrain_routine(model)
File "C:\tools\miniconda3\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1015, in run_pretrain_routine
    self.train()
File "C:\tools\miniconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 347, in train
    self.run_training_epoch()
File "C:\tools\miniconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 419, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
File "C:\tools\miniconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 597, in run_training_batch
    loss, batch_output = optimizer_closure()
File "C:\tools\miniconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 561, in optimizer_closure
    output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
    output = self.model.training_step(*args)
File "C:\tools\miniconda3\lib\site-packages\aitextgen\train.py", line
39, in training_step
    outputs = self({"input_ids": batch, "labels": batch})
File "C:\tools\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
File "C:\tools\miniconda3\lib\site-packages\aitextgen\train.py", line
34, in forward
    return self.model(**inputs)
File "C:\tools\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
File "C:\tools\miniconda3\lib\site-packages\transformers\modeling_gpt2.py", line 624, in forward
    shift_logits = lm_logits[..., :-1, :].contiguous()
RuntimeError: CUDA out of memory. Tried to allocate 746.00 MiB (GPU 0; 8.00 GiB total capacity; 3.48 GiB already allocated; 0 bytes free; 17.35
MiB cached)
0%|          | 0/6000 [00:05<?, ?it/s]

log.txt

Called:

    ai.train(data,
        line_by_line=True,  
        output_dir=path,
        batch_size=256,
        num_steps=8000,
        save_every=2000,
        num_workers=1
        )

zephyo avatar May 31 '20 09:05 zephyo

Update: Got it working by setting batch_size=128, num_workers=2, and running:

python
import torch
torch.cuda.empty_cache()

zephyo avatar May 31 '20 10:05 zephyo

If you had to empty the cache it may be unrelated. I guess it won't hurt to add an automatic clear before training.

minimaxir avatar May 31 '20 15:05 minimaxir

@minimaxir, so this thread discusses pretty much the same thing. I have a large file, 1.5 GB specifically of Arabic text that I want to train on.

Is there a way this library could handle a file this size? Like training on one batch at a time or split the file into chunks and feed the trainer gradually?

mohataher avatar Jun 12 '20 20:06 mohataher

@minimaxir Thanks for the wonderful blog article and also this library.

I started using last evening and I had a 60MB file and independently worked through the same issues on this thread, though I am learning new things on this thread. Basically, I split out the tokenizer. I would suggest you could update the Colab notebook that way so more users who pick it up dont face same issue.

I came here to see if there is a plan to add support to take in a folder containing multiple files?

GPT2 encode.py can take in either a file OR a folder containing files. And if there is an npz file already, it is simply appended. That would be a good feature.

Thanks again.

ravi-annaswamy avatar Jul 09 '20 14:07 ravi-annaswamy

same here i had a 80MB file and it throw OOM error

annasajkh avatar Oct 22 '21 19:10 annasajkh

Hi, I'm trying to train a model from scratch as I want it to generate text in another language (Swedish). My trainingdata is a large collection of novels, about 22 000 that are all in one single .txt-file delimited by a line with only a <s> The txt-file is about 300MB in size. However, both when I try to train it from scratch using the Colab notebook(with a P100 GPU) or locally on my desktop it runs out of memory and crashes. My desktop has 32GB RAM and a Geforce 2080Ti with 11GB VRAM.

Is there any way to make aitextgen work with 300MB trainingdata? Are there any parameters I can tweak to have it use less memory? Should I arrange the trainingdata in another way? image

what is your max_length and n_embd size. the only fix i can say without them is just remove batch_size

breadbrowser avatar Jul 16 '22 15:07 breadbrowser