bigbird
bigbird copied to clipboard
Error in PubMed evaluation using run_summarization.py
I am using the script roberta_base.sh
to train and test the model on PubMed summarization task. I am able to successfully train the model for multiple steps (5000) but it fails during evaluation time. Below is some of the error string.
I0416 18:16:41.567906 139788890330944 error_handling.py:115] evaluation_loop marked as finished
WARNING:tensorflow:Reraising captured error
W0416 18:16:41.568143 139788890330944 error_handling.py:149] Reraising captured error
Traceback (most recent call last):
File "bigbird/summarization/run_summarization.py", line 534, in <module>
app.run(main)
...
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2268, in create_tpu_hostcall
'dimension, but got scalar {}'.format(dequeue_ops[i][0]))
RuntimeError: All tensors outfed from TPU should preserve batch size dimension, but got scalar Tensor("OutfeedDequeueTuple:0", shape=(), dtype=float32, device=/job:worker/task:0/device:CPU:0)
I am not too familiar with the code and about this error. Searched it online but didn't get much help. Hope you can help. Below is the script which I ran to reproduce this error:
python3 bigbird/summarization/run_summarization.py \
--data_dir="tfds://scientific_papers/pubmed" \
--output_dir=gs://bigbird-replication-bucket/summarization/pubmed \
--attention_type=block_sparse \
--couple_encoder_decoder=True \
--max_encoder_length=3072 \
--max_decoder_length=256 \
--num_attention_heads=12 \
--num_hidden_layers=12 \
--hidden_size=768 \
--intermediate_size=3072 \
--block_size=64 \
--train_batch_size=2 \
--eval_batch_size=4 \
--num_train_steps=1000 \
--do_train=True \
--do_eval=True \
--use_tpu=True \
--tpu_name=bigbird \
--tpu_zone=us-central1-b \
--gcp_project=bigbird-replication \
--num_tpu_cores=8 \
--save_checkpoints_steps=1000 \
--init_checkpoint=gs://bigbird-transformer/pretrain/bigbr_base/model.ckpt-0
I am also facing similar issue on my custom dataset. Evaluation works if the use_tpu is made false and code is run on GPU or CPU. But it takes way longer. Any thoughts on how to resolve this ?
I am also facing similar issue on my custom dataset. Evaluation works if the use_tpu is made false and code is run on GPU or CPU. But it takes way longer. Any thoughts on how to resolve this ?
Hi @prathameshk, can I ask how do you finetune the model on your custom dataset? I was thinking replace data_dir
by path_contains_tfrecords
, but I got error:
(0) Invalid argument: Feature: document (data type: string) is required but could not be found.
[[{{node ParseSingleExample/ParseExample/ParseExampleV2}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNext]]
[[Mean/_19475]]
Updates: I solved this problem by replacing the name_to_features fields with the actual fields in the tfrecord file.
If you haven't already, then check out the HuggingFace implementation of BigBird. That can be easier to use and integrate with your project.