language model hyperparameters
❓ Questions and Help
Before asking:
- search the issues.
- search the docs.
What is your question?
hi, hope you're well I have a dataset of journal abstracts separated by empty lines. I want to use each one of them as a training sample (I read that I should use --sample-break-mode complete_doc) and I want to run that on 1 GPU for 200k steps, with 1024 tokens and 64 accumulated steps. also, I want to use Adam as the optimizer with a peak learning rate of 0.0002, and 20000 warm-up steps. The learning rate follows an inverse square root decay schedule after reaching the peak.
I'm not sure if I chose the right values for each hyperparameter in the command, could you please modify it? also, I read somewhere that batchsize=--max-tokens/--tokens-per-sample but I'm not sure how to set the batch size when I'm using --sample-break-mode complete_doc. I think when we use --sample-break-mode complete_doc the tokens-per-sample does not matter, right?
Code
What have you tried?
!fairseq-train --task language_modeling data-bin
--save-dir checkpoints/transformer \
--arch transformer_lm_gpt2_medium --share-decoder-input-output-embed
--dropout 0.1 --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0
--lr 0.0002 \
--lr-scheduler inverse_sqrt --warmup-updates 20000 --warmup-init-lr 1e-07
--sample-break-mode complete_doc --tokens-per-sample 512\
--max-tokens 2048 --update-freq 64 --fp16 --bpe fastbpe \
--max-update 50000 --max-epoch 15
What's your environment?
- fairseq Version (e.g., 1.0 or main):
- PyTorch Version (e.g., 1.0)
- OS (e.g., Linux):
- How you installed fairseq (
pip, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information: