simpletransformers icon indicating copy to clipboard operation
simpletransformers copied to clipboard

Seq2seq: failure when evaluating while training

Open Futyn-Maker opened this issue 1 year ago • 8 comments

Describe the bug During the training of Seq2seq-type models, with evaluation enabled, a Pandas error ValueError: All arrays must be of the same length occurs in Google Colab (free plan) during evaluation. To Reproduce Steps to reproduce the behavior:

def main(args):
    model_args = {
        "do_lower_case": True,
        "reprocess_input_data": True,
        "overwrite_output_dir": True,
        "max_seq_length": max([len(token) for token in train_df["target_text"].tolist()]),
        "train_batch_size": 256
        "num_train_epochs": 5
        "save_eval_checkpoints": False,
        "save_model_every_epoch": False,
        "evaluate_during_training": True,
        "evaluate_during_training_verbose": True,
        "use_multiprocessing": False,
        "save_best_model": False,
        "max_length": max([len(token) for token in train_df["input_text"].tolist()]),
        "save_steps": -1,
    }
    model = Seq2SeqModel(
        encoder_decoder_type="bart"
        encoder_decoder_name="facebook/bart-base"
        args=model_args,
	use_cuda = torch.cuda.is_available(),)    
    model.train_model(train_df, eval_data=eval_df, matches=count_matches, accuracy=accuracy_score, f1=f1_score)

Expected behavior A clear and concise description of what you expected to happen.

Learning and evaluating without failures

Screenshots If applicable, add screenshots to help explain your problem. Not applicable

Desktop (please complete the following information):

  • OS Windows 11 (But in fact using in Google Colab) Additional context Add any other context about the problem here.

Here are the reduced logs:

2023-05-06 17:56:43.337391: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Downloading (…)lve/main/config.json: 100% 1.72k/1.72k [00:00<00:00, 8.86MB/s]
Downloading pytorch_model.bin: 100% 558M/558M [00:25<00:00, 21.6MB/s]
Downloading (…)olve/main/vocab.json: 100% 899k/899k [00:00<00:00, 1.29MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 875kB/s]
Downloading (…)/main/tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 1.56MB/s]
100% 79032/79032 [00:21<00:00, 3710.01it/s]
Epoch 1 of 5:   0% 0/5 [00:00<?, ?it/s]
Running Epoch 0 of 5:   0% 0/309 [00:00<?, ?it/s]
Epochs 1/5. Running Loss:   10.1010:   0% 0/309 [00:03<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "

Epochs 1/5. Running Loss:   10.1010:   0% 1/309 [00:03<18:27,  3.60s/it]
Epochs 1/5. Running Loss:   10.5836:   0% 1/309 [00:03<18:27,  3.60s/it]
Epochs 1/5. Running Loss:   10.5836:   1% 2/309 [00:03<08:38,  1.69s/it]
...
Epochs 1/5. Running Loss:    0.0396: 100% 309/309 [02:17<00:00,  2.26it/s]
  0% 0/10011 [00:00<?, ?it/s]
  0% 1/10011 [00:31<86:57:41, 31.27s/it] (some strange deadlock here)
100% 10011/10011 [01:14<00:00, 135.21it/s]
Epoch 1 of 5:   0% 0/5 [03:59<?, ?it/s]
Traceback (most recent call last):
  File "/content/transformer-lemmatiser-ruthenian/seq2seq.py", line 56, in <module>
    main(args)
  File "/content/transformer-lemmatiser-ruthenian/seq2seq.py", line 45, in main
    model.train_model(train_df, eval_data=eval_df, matches=count_matches, accuracy=accuracy_score, f1=f1_score)
  File "/usr/local/lib/python3.10/dist-packages/simpletransformers/seq2seq/seq2seq_model.py", line 450, in train_model
    global_step, training_details = self.train(
  File "/usr/local/lib/python3.10/dist-packages/simpletransformers/seq2seq/seq2seq_model.py", line 1005, in train
    report = pd.DataFrame(training_progress_scores)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py", line 664, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py", line 493, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py", line 118, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py", line 666, in _extract_index
    raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length

Futyn-Maker avatar May 07 '23 03:05 Futyn-Maker

The same problem, repeats almost everywhere, with the same conditions. OSs Ubuntu & Windows 11.

Hey @The-One-Who-Speaks-and-Depicts and @Futyn-Maker, Any luck in solving this issue as I am facing the same error during model_train

Moustafa-Banbouk avatar Jun 11 '23 11:06 Moustafa-Banbouk

@Moustafa-Banbouk I have been experiencing this for the year or so, no ideas. I just switched the validation off in args, and called it a day.

I just switched the validation off in args, and called it a day.

The same for now, it didn't really interfere with the project I was working on at the time - but I consider it an extremely critical bug.

Futyn-Maker avatar Jun 11 '23 22:06 Futyn-Maker

@The-One-Who-Speaks-and-Depicts @Futyn-Maker @Moustafa-Banbouk Can you guys try disabling multi-processing using,

use_multiprocessing = False
use_multiprocessing_for_evaluation = False

DamithDR avatar Jun 19 '23 22:06 DamithDR

@DamithDR At least in my case it works, I created a PR.

@Futyn-Maker @Moustafa-Banbouk /fyi

@The-One-Who-Speaks-and-Depicts Glad that it helped :) About the PR, I think this issue only re-creates on servers which are having multiple GPUs. The real issue is in the Seq2SeqDataset class where it initiates a pool of processes to get the sample list. A proper fix will have to look into this area.

DamithDR avatar Jun 20 '23 12:06 DamithDR

@DamithDR I had this issue on my laptop, and on a server, where I used only one GPU.