DialogBERT icon indicating copy to clipboard operation
DialogBERT copied to clipboard

Checkpoints available?

Open pablogranolabar opened this issue 3 years ago • 10 comments

Hi, thanks for making your work available.

Are there any checkpoints available for DialogBERT to experiment with, or does it have to be trained from scratch using main.py?

Can you share what hardware configuration including GPU or CPU memory that you used to train -Medium and -Large? I am getting OOM on a K80 even attempting to train -Medium.

And how long did it take to train -Medium and -Large?

Thanks in advance!

pablogranolabar avatar May 01 '21 03:05 pablogranolabar

We apologize that there is no checkpoint available. You have to train it from scratch using main.py.

The -Medium and -Large had shown to be overfitting in DailyDialog and MultiWOZ datasets and produced suboptimal performance. We only used the 'tiny' configuration for these two datasets. We used the Nvidia P40 GPU with 24GB memory to train all models. It tooks 3 days for trainining the base-sized model on the Weibo dataset and around 5 hours for the tiny model on the dailydialog dataset.

guxd avatar May 03 '21 02:05 guxd

Hi again. So I've been training DialogBERT Large for about 12 days now, had two days left on the full training run then accidentally shut down the machine. I backed up the entire directory so as to not overwrite any checkpoints; when I resumed the training it started over at 0 with 169 epochs. Does the model architecture include resume of training if it's interrupted or do I have to start it all over again?

Also I noticed in main.py that there is a dataset option and you had mentioned overfitting on the larger models. Is there an optimal set of execution parameters that you would recommend for scratch training DialogBERT-Large?

pablogranolabar avatar May 11 '21 17:05 pablogranolabar

You have to start it all over again in this situation. If you want to speed up training, you can reduce the frequency of testing/validation. Or you can start validation/testing after a number of epochs.

There is no optional set of parameters. If you train your own dataset, you need to finetune your hyperparameters.

guxd avatar May 12 '21 00:05 guxd

So what happens after pretraining? What options would be used to load the saved model checkpoints with this configuration?

Thanks in advance :)

pablogranolabar avatar May 12 '21 00:05 pablogranolabar

I updated the code to include the test script. Run python main.py --do_test --reload_from XXXX where XXXX specifies the iteration number of your optimal checkpoint.

guxd avatar May 12 '21 05:05 guxd

Thank you!

pablogranolabar avatar May 12 '21 20:05 pablogranolabar

So the base-sized model is medium? Is there a difference between base and medium? So on a P40 base was 3 days; do you have any thoughts on what the large parameter sized model will require in terms of total training time on a single V100? I've nohupped the training process on an AWS instance and for some reason tqdm only shows the current epoch with nohup.out instead of the total progress so I am trying to figure out how much time Large parameter model training will take using the stock configuration with main.py.

Thanks in advance as always :)

pablogranolabar avatar May 18 '21 08:05 pablogranolabar

The base-sized model is just the same size as bert-base.

We did not try to train a large-sized model, so we cannot give you advice. You have to train the large model in your dataset by yourself.

guxd avatar May 19 '21 14:05 guxd

Ok, I stopped training Large and attempted to restart the training process but I'm getting the following error:

$ python3 main.py --model_size=large --per_gpu_train_batch_size=24 --do_test --reload_from=640
number of gpus: 0
05/23/2021 00:19:52 - WARNING - __main__ -  Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
05/23/2021 00:20:05 - INFO - __main__ -  Training/evaluation parameters Namespace(adam_epsilon=1e-08, data_path='./data/dailydial', dataset='dailydial', device=device(type='cpu'), do_test=True, fp16=False, fp16_opt_level='O1', grad_accum_steps=2, language='english', learning_rate=5e-05, local_rank=-1, logging_steps=200, max_grad_norm=1.0, max_steps=200000, model='DialogBERT', model_size='large', n_epochs=1.0, n_gpu=0, per_gpu_eval_batch_size=1, per_gpu_train_batch_size=24, reload_from=640, save_steps=5000, save_total_limit=100, seed=42, server_ip='', server_port='', validating_steps=20, warmup_steps=5000, weight_decay=0.01)
Traceback (most recent call last):
  File "main.py", line 115, in <module>
    main()
  File "main.py", line 110, in main
    results = solver.evaluate(args)
  File "/home/ubuntu/DialogBERT/solvers.py", line 79, in evaluate
    self.load(args)
  File "/home/ubuntu/DialogBERT/solvers.py", line 53, in load
    assert args.reload_from<=0, "please specify the checkpoint iteration in args.reload_from" 
AssertionError: please specify the checkpoint iteration in args.reload_from

Where do I get the checkpoint iteration from? I've tried the most recent batch numbers as well as valid numbers (640 from valid_results640.txt) but it doesn't look like those are the checkpoint iteration #. Where do I get the checkpoint iteration # from?

Thanks in advance!

pablogranolabar avatar May 23 '21 00:05 pablogranolabar

you should specify --reload_from 640 when you run python main.py

guxd avatar May 23 '21 14:05 guxd