albert
albert copied to clipboard
Training from scratch on TPU
Is it possible to train Albert from scratch in another language using a TPU v3 (128Gb)?
Could you give an estimated training time? Days, weeks, months?
What is a reasonable corpus size? 1B words? Should the seq_length be reduced from the default 512?
Yes, you can. I trained with my own language (~1B words). However you have to adjust batch size from 4096 to ~512 with other configs is set default. The running time is ~3 weeks to get ~1M steps.
Thanks @ngoanpv! Just what I needed to know.
For preparing the dataset, did you build your own custom vocabulary using Sentencepiece? Did you use the default dupe_factor of 40? If so, did that lead to a more than 1Tb training data set?
Yes, I use Sentencepiece to build vocab and use dup_factor = 10 is ok.
Any idea of training on GPU, not TPU? I am using Nvidia's V100, for the base model, I can only set the train_batch_size=32
.
BTW, what learning rate did you use ?
Thanks
Just made some tests. I am not able to get as high batch size as you mentioned, @ngoanpv.
I am able to train on a batch_size of 392 on the base model, and 144 on the large model.
Does anyone have experiences regarding whether this is too low? And/or if it would be necessary to reduce the sequence length from the default 512 to get reasonable results. Other alternatives?
@ngoanpv Could you please post the spm_train
command that you used for training the sentencepiece model? Does it work with:
spm_train --input corpus.txt --model_prefix=model_cased --vocab_size=32000 --character_coverage=0.99995 --model_type=unigram --control_symbols="[MASK],[UNK],[CLS],[SEP]"
special symbols like mask, unk are important, but what order did you use?
🤔
Hi @stefan-it. Commenting with my experiences here.
I ended up using model_type=bpe, since the Bert paper mentions that WordPiece is similar to bpe. I cant find native support for building a vocabulary based on WordPiece.
I really dont think the order of the control-symbols matters. They are just IDs. I ended up using this config --unk_piece=[UNK] --pad_piece=[PAD] --user_defined_symbols=[CLS],[SEP],[MASK]
I needed to make the following addition to make SentencePiece run on my entire corpus: --input_sentence_size=10000000 --shuffle_input_sentence=true
The vocab-files created by SentencePiece are slightly different from what is supported by AlBert/Bert. Trimming away the frequencies (cat vocab.txt | cut -d$'\t' -f1 > newvocab.txt) is easy but not really required.
It does however use "_" instead of "##" for tokens. This will prevent it from working. A simple search/replace will fix this.
An alternative solution here is to not build the vocab from scratch but instead use the multilinugual-cased vocab-file that comes with the pre-trained Bert weights.
@peregilk I really run with batch size 512.
@stefan-it I run with python module
import sentencepiece as spm
spm.SentencePieceTrainer.Train('--input=<inputname> --vocab_size=30000 --model_prefix=prefix_name --pad_id=0 --unk_id=1 --pad_piece=<pad> --unk_piece=<unk> --bos_id=-1 --eos_id=-1 --control_symbols=[CLS],[SEP],[MASK],<pad> --user_defined_symbols=(,),",-,.,–,£,€')
Thanks, @ngoanpv. Tried again, and was able to get it above 500 by turning off the dropout-layers.
Thanks for your anwers on vocab generation :heart:
I'll try it again and report back results here :)
If you turn off dropout, you may be able to use a larger batch size.
@ngoanpv and @wxp16, I did a training for a v3-8 TPU and a Tesla V100 from a DGX-1, using batch sizes 512 and 24 (couldn't fit 32). What amazes me is that for both these settings, training time was the same...took about 30 hours to train for my wikipedia corpus, running 125k iterations. The V100 I used has 32Gb, which is 1/4 of the 128Gb available on the v3-8 TPU. With a batch size 16x greater, shouldn't the TPU training also be around 16x times faster? At least from the logs I can see that the TPU processes around 560 examples/sec, while the V100 logged around 25 examples/sec. The examples/sec seem proportional to the batch size used, but the training time remains the same.
@peregilk I really run with batch size 512.
@stefan-it I run with python module
import sentencepiece as spm spm.SentencePieceTrainer.Train('--input=<inputname> --vocab_size=30000 --model_prefix=prefix_name --pad_id=0 --unk_id=1 --pad_piece=<pad> --unk_piece=<unk> --bos_id=-1 --eos_id=-1 --control_symbols=[CLS],[SEP],[MASK],<pad> --user_defined_symbols=(,),",-,.,–,£,€')
@ngoanpv did you do a lowercase in your corpus prior to training sentencepiece?
Here to share some of my experience using an RTX 2080TI. Used a learning rate of 1e-4, batch size 12 on the first 150k steps that took around 30 hours. Update: My best performing model was around 260k steps where I reached a full epoch. The 260k step Turkish Albert I trained outperformed the "stock" mBert on my intent classification task, though the reason to that is mBert not being fine-tuned, Now training from scratch and using a batch size of 55 with sequence length 128.
Here to share some of my experience using an RTX 2080TI. Used a learning rate of 1e-4, batch size 12 on the first 150k steps that took around 30 hours. Update: My best performing model was around 260k steps where I reached a full epoch. The 260k step Turkish Albert I trained outperformed the "stock" mBert on my intent classification task, though the reason to that is mBert not being fine-tuned, Now training from scratch and using a batch size of 55 with sequence length 128.
@ahadsuleymanli Thank you for sharing your experience. Can you specify the size of your data?
When you say TPU v3 (128Gb), do you mean the TPU-v3 with 128 cores?
When training with small batch sizes, are you seeing drop in performance compared to higher batch sizes?
@steindor. Sorry for the confusion. I mean a v3-8 with 128GB of memory. 8 cores.
I have done quite a lot of experiments on this after this post was written. Increasing the batch size is in general very positive to training. My advise is to increase it until you start noticing instability in the mlm-accuracy/loss. You will see minor instability long before you get the "collapse" I discussed in this thread. Stop increasing as soon as you see this instability.
However, the adamw optimizer is not doing a good job with large batch sizes. And even with training on a v3-8 on the smaller sequence lengths, you will notice this. On the large machines (like the 128 cores pods you are mentioning) it simply is no point in doing maximum batch size using the adamw optimizer. We had great results with using the LAMB optimizer in BERT. This is all based on training BERT. I have not enough experience with this using ALBERT.
Read the "76 minutes-paper". https://arxiv.org/abs/1904.00962
@peregilk Thank you for the response and the article. Interesting regarding the performance of LAMB vs adam, I will be training ALBERT with V3-256 TPU in the coming weeks, will be useful trying out both optimizers. Any insights into how very large batch sizes do with LAMB? ( > 4096)
@peregilk Thank you for the response and the article. Interesting regarding the performance of LAMB vs adam, I will be training ALBERT with V3-256 TPU in the coming weeks, will be useful trying out both optimizers. Any insights into how very large batch sizes do with LAMB? ( > 4096)
Looking forward to know about how your experiment went!. Best regards,