question_generation icon indicating copy to clipboard operation
question_generation copied to clipboard

AssertionErrors

Open judywxy opened this issue 4 years ago • 12 comments

Doing the finetuning and keeping seeing the follow assertion errors

Epoch: 0%| | 0/8 [00:00<?, ?it/s/opt/conda/lib/python3.6/site-packages/nlp/utils/py_utils.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.) return function(data_struct) Traceback (most recent call last): File "./do_training.py", line 34, in run_qg(args) File "/workspace/QG/QG/run_qg.py", line 233, in run_qg main(args_file = "args.json") File "/workspace/QG/QG/run_qg.py", line 198, in main model_path = model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train tr_loss += self._training_step(model, inputs, optimizer) File "/workspace/QG/QG/trainer.py", line 36, in _training_step outputs = model(**inputs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) AssertionError: Caught AssertionError in replica 0 on device 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/transformers/modeling_t5.py", line 1094, in forward assert kwargs == {}, f"Unexpected keyword arguments: {list(kwargs.keys())}." AssertionError: Unexpected keyword arguments: ['return_tuple'].

judywxy avatar Aug 22 '20 01:08 judywxy

When the device is CPU, then this assertion error disappears.

judywxy avatar Aug 22 '20 03:08 judywxy

Hi @judywxy , what is your transformers version, It runs fine with version 3.0.0

try pip install -U transformers==3.0.0

patil-suraj avatar Aug 24 '20 14:08 patil-suraj

@patil-suraj Oh, Thanks a lot. I have the following installed:

tokenizers 0.8.1rc1 torch 1.6.0+cu101 torchfile 0.1.0 torchvision 0.7.0+cu101 tornado 6.0.3 tqdm 4.32.1 traitlets 4.3.2 transformers 3.0.2 typing 3.6.4 urllib3 1.24.2 visdom 0.1.8.8 wandb 0.9.5

So, you mean change transformers from 3.0.2 to 3.0.0 ?

judywxy avatar Aug 24 '20 17:08 judywxy

Yes, I havey tried it with 3.0.2 yet

patil-suraj avatar Aug 24 '20 17:08 patil-suraj

@patil-suraj Thanks for prompt reply. I will try with 3.0.0. By the way, I trained a t5-small single task qg with transfermers' trainer 08/24/2020 06:24:24 - INFO - qgtrain - ***** Eval results ***** 08/24/2020 06:24:24 - INFO - qgtrain - epoch = 9.999269539810081 08/24/2020 06:24:24 - INFO - qgtrain - eval_loss = 1.6273178581207517 It looks nice as the following three metrics show BLEU_4 | METEOR | ROUGE_L 0.189037 | 0.252798 | 0.406141 Slightly better than the published counterpart model

So, I want to train a multi-task one like t5-multi and want to change the following config. Besides changing the train_file_path, valid_file_path, output_dir, shall I also change the model_name_or_path from t5-small to t5-base? what about the tokenizer_name_or_path?

args = { "model_name_or_path": "t5-small", "model_type": "t5", "tokenizer_name_or_path": "t5_qg_tokenizer", "output_dir": "../QG_models03/t5-small-qg-hl", "train_file_path": "../QG_data/train_data_qg_highlight_qg_format_t5.pt", "valid_file_path": "../QG_data/valid_data_qg_highlight_qg_format_t5.pt", "qg_format": "highlight_qg_format", "per_device_train_batch_size": 32, "per_device_eval_batch_size": 24, "gradient_accumulation_steps": 8, "learning_rate": 1e-4, "num_train_epochs": 12, "no_cuda": True, "seed": 1, # Default 42 "do_train": True, "do_eval": True, "evaluate_during_training": True, "logging_steps" :100 }

judywxy avatar Aug 24 '20 18:08 judywxy

@patil-suraj after changing transformers from 3.0.2 to 3.0.0 and setting "no_cuda": False, The Assertion errors appear again!

tokenizers 0.8.0rc4 torch 1.6.0+cu101 torchfile 0.1.0 torchvision 0.7.0+cu101 tornado 6.0.3 tqdm 4.32.1 traitlets 4.3.2 transformers 3.0.0 typing 3.6.4 urllib3 1.24.2 visdom 0.1.8.8 wandb 0.9.5

judywxy avatar Aug 24 '20 21:08 judywxy

@patil-suraj

The Assertion error is related to the following code in trainer's Trainer class

    # Our model outputs do not work with DataParallel, so forcing return tuple.
    if isinstance(model, nn.DataParallel):
        inputs["return_tuple"] = True

judywxy avatar Aug 25 '20 06:08 judywxy

Hey @judywxy , could check again if your version is correct, because that change was added in Trainer after 3.0.0.

You can see the Trainer at v3.0.0 here

patil-suraj avatar Aug 25 '20 11:08 patil-suraj

@patil-suraj here are what installed. The transformers is in version 3.0.0 tokenizers 0.8.0rc4 torch 1.6.0+cu101 torchfile 0.1.0 torchvision 0.7.0+cu101 tornado 6.0.3 tqdm 4.32.1 traitlets 4.3.2 transformers 3.0.0 typing 3.6.4 urllib3 1.24.2 visdom 0.1.8.8 wandb 0.9.5

judywxy avatar Aug 25 '20 17:08 judywxy

Is this issue resolved?. I am still facing issues while training.

varshith321 avatar Feb 08 '21 17:02 varshith321

I've got the same problem recently. My workaround was using only one GPU.

CUDA_VISIBLE_DEVICES=0 python run_qg.py \
    --model_name_or_path t5-small \
    --model_type t5 \
    --tokenizer_name_or_path t5_qg_tokenizer \
    --output_dir t5-small-qg-hl \
    --train_file_path data/train_data_qg_hl_t5.pt \
    --valid_file_path data/valid_data_qg_hl_t5.pt \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 32 \
    --gradient_accumulation_steps 8 \
    --learning_rate 1e-4 \
    --num_train_epochs 10 \
    --seed 42 \
    --do_train \
    --do_eval \
    --evaluate_during_training \
    --logging_steps 100

daisylab avatar Feb 13 '21 03:02 daisylab

I found that if I let CUDA_VISIBLE_DEVICES='0', which means using only one GPU, the code works.

The error rises from trainer.py: #Our model outputs do not work with DataPrallel, so forcing return tuple. if isinstance(model, nn.DataParallel): inputs["return_tuple"] = True

These lines of codes make error when the model forwards data

result = self.forward(*input, **kwargs)

yanghoonkim avatar Apr 27 '21 07:04 yanghoonkim