question_generation
question_generation copied to clipboard
AssertionErrors
Doing the finetuning and keeping seeing the follow assertion errors
Epoch: 0%| | 0/8 [00:00<?, ?it/s/opt/conda/lib/python3.6/site-packages/nlp/utils/py_utils.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
return function(data_struct)
Traceback (most recent call last):
File "./do_training.py", line 34, in
When the device is CPU, then this assertion error disappears.
Hi @judywxy , what is your transformers version,
It runs fine with version 3.0.0
try pip install -U transformers==3.0.0
@patil-suraj Oh, Thanks a lot. I have the following installed:
tokenizers 0.8.1rc1 torch 1.6.0+cu101 torchfile 0.1.0 torchvision 0.7.0+cu101 tornado 6.0.3 tqdm 4.32.1 traitlets 4.3.2 transformers 3.0.2 typing 3.6.4 urllib3 1.24.2 visdom 0.1.8.8 wandb 0.9.5
So, you mean change transformers from 3.0.2 to 3.0.0 ?
Yes, I havey tried it with 3.0.2 yet
@patil-suraj Thanks for prompt reply. I will try with 3.0.0. By the way, I trained a t5-small single task qg with transfermers' trainer 08/24/2020 06:24:24 - INFO - qgtrain - ***** Eval results ***** 08/24/2020 06:24:24 - INFO - qgtrain - epoch = 9.999269539810081 08/24/2020 06:24:24 - INFO - qgtrain - eval_loss = 1.6273178581207517 It looks nice as the following three metrics show BLEU_4 | METEOR | ROUGE_L 0.189037 | 0.252798 | 0.406141 Slightly better than the published counterpart model
So, I want to train a multi-task one like t5-multi and want to change the following config. Besides changing the train_file_path, valid_file_path, output_dir, shall I also change the model_name_or_path from t5-small to t5-base? what about the tokenizer_name_or_path?
args = { "model_name_or_path": "t5-small", "model_type": "t5", "tokenizer_name_or_path": "t5_qg_tokenizer", "output_dir": "../QG_models03/t5-small-qg-hl", "train_file_path": "../QG_data/train_data_qg_highlight_qg_format_t5.pt", "valid_file_path": "../QG_data/valid_data_qg_highlight_qg_format_t5.pt", "qg_format": "highlight_qg_format", "per_device_train_batch_size": 32, "per_device_eval_batch_size": 24, "gradient_accumulation_steps": 8, "learning_rate": 1e-4, "num_train_epochs": 12, "no_cuda": True, "seed": 1, # Default 42 "do_train": True, "do_eval": True, "evaluate_during_training": True, "logging_steps" :100 }
@patil-suraj after changing transformers from 3.0.2 to 3.0.0 and setting "no_cuda": False, The Assertion errors appear again!
tokenizers 0.8.0rc4 torch 1.6.0+cu101 torchfile 0.1.0 torchvision 0.7.0+cu101 tornado 6.0.3 tqdm 4.32.1 traitlets 4.3.2 transformers 3.0.0 typing 3.6.4 urllib3 1.24.2 visdom 0.1.8.8 wandb 0.9.5
@patil-suraj
The Assertion error is related to the following code in trainer's Trainer class
# Our model outputs do not work with DataParallel, so forcing return tuple.
if isinstance(model, nn.DataParallel):
inputs["return_tuple"] = True
Hey @judywxy ,
could check again if your version is correct, because that change was added in Trainer
after 3.0.0.
You can see the Trainer
at v3.0.0 here
@patil-suraj here are what installed. The transformers is in version 3.0.0 tokenizers 0.8.0rc4 torch 1.6.0+cu101 torchfile 0.1.0 torchvision 0.7.0+cu101 tornado 6.0.3 tqdm 4.32.1 traitlets 4.3.2 transformers 3.0.0 typing 3.6.4 urllib3 1.24.2 visdom 0.1.8.8 wandb 0.9.5
Is this issue resolved?. I am still facing issues while training.
I've got the same problem recently. My workaround was using only one GPU.
CUDA_VISIBLE_DEVICES=0 python run_qg.py \
--model_name_or_path t5-small \
--model_type t5 \
--tokenizer_name_or_path t5_qg_tokenizer \
--output_dir t5-small-qg-hl \
--train_file_path data/train_data_qg_hl_t5.pt \
--valid_file_path data/valid_data_qg_hl_t5.pt \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--learning_rate 1e-4 \
--num_train_epochs 10 \
--seed 42 \
--do_train \
--do_eval \
--evaluate_during_training \
--logging_steps 100
I found that if I let CUDA_VISIBLE_DEVICES='0', which means using only one GPU, the code works.
The error rises from trainer.py:
#Our model outputs do not work with DataPrallel, so forcing return tuple.
if isinstance(model, nn.DataParallel):
inputs["return_tuple"] = True
These lines of codes make error when the model forwards data
result = self.forward(*input, **kwargs)