transformers dictionary update sequence element #5 has length 1; 2 is required

System Info

transformers version: 4.20.0.dev0
Platform: Linux-5.4.0-66-generic-x86_64-with-glibc2.31
Python version: 3.9.12
Huggingface_hub version: 0.7.0
PyTorch version (GPU?): 1.11.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

transformers/examples/pytorch/language-modeling/run_mlm.py @LysandreJik @sgugger

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I want to pre-train RoBERTa from scratch on my own dataset using transformers/examples/pytorch/language-modeling/run_mlm.py.

I run the command:

python run_mlm.py \
    --model_type roberta \
    --tokenizer_name /CodeSearchNet/code_txt/tokenizer \
    --config_overrides vocab_size=52_000,max_position_embeddings=514,num_attention_heads=12,num_hidden_layers=12,type_vocab_size=1 \
    --train_file /data_for_train_tokenizer/CodeSearchNet/train_codes.txt \
    --validation_file /data_for_train_tokenizer/CodeSearchNet/valid_codes.txt \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 64 \
    --num_train_epochs 100 \
    --overwrite_output_dir \
    --line_by_line \
    --save_steps 5000 \
    --do_train \
    --do_eval \
    --output_dir /CodeSearchNet/code_txt/model/pretrain_Roberta_from_scratch/CSN/single_file \
    --logging_dir /CodeSearchNet/code_txt/log/pretrain_Roberta_from_scratch_CSN_single_file

There is an error:

07/09/2022 02:00:22 - WARNING - __main__ - You are instantiating a new config instance from scratch.
07/09/2022 02:00:22 - INFO - __main__ - Overriding config: vocab_size=52_000,max_position_embeddings=514,num_attention_heads=12,num_hidden_layers=12,type_vocab_size=1,
Traceback (most recent call last):
  File "/transformers/examples/pytorch/language-modeling/run_mlm.py", line 612, in <module>
    main()
  File "/transformers/examples/pytorch/language-modeling/run_mlm.py", line 359, in main
    config.update_from_string(model_args.config_overrides)
  File "/transformers/src/transformers/configuration_utils.py", line 850, in update_from_string
    d = dict(x.split("=") for x in update_str.split(","))
ValueError: dictionary update sequence element #5 has length 1; 2 is required

How to set config_overrides in run_mlm.py?

When I set per_device_eval_batch_size 64, there is an error:

RuntimeError: CUDA out of memory. Tried to allocate 21.48 GiB (GPU 0; 39.59 GiB total capacity; 26.26 GiB already allocated; 11.40 GiB free; 26.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

  0%|          | 0/175900 [00:27<?, ?it/s]

Load imbalance caused by data parallelism. How to set up distributed data parallelism in trainer?

Expected behavior

Be able to train Roberta from scratch in DDP mode using large batch size.

Jul 09 '22 13:07 skye95git

Are you sure you pasted the exact command you ran? I have no error when trying it on my side and the config is successfully updated. To use distributed training, just use the pytorch launcher instead of python to run your script, see here.

Jul 11 '22 13:07 sgugger

Are you sure you pasted the exact command you ran? I have no error when trying it on my side and the config is successfully updated. To use distributed training, just use the pytorch launcher instead of python to run your script, see here.

Yes. I'm sure. Maybe I should change --config_overrides vocab_size=52_000,max_position_embeddings=514,num_attention_heads=12,num_hidden_layers=12,type_vocab_size=1 to --config_overrides "vocab_size=52_000,max_position_embeddings=514,num_attention_heads=12,num_hidden_layers=12,type_vocab_size=1"? In other words, should quotes be added to the config_overrides parameter?

Jul 13 '22 03:07 skye95git

Are you sure you pasted the exact command you ran? I have no error when trying it on my side and the config is successfully updated. To use distributed training, just use the pytorch launcher instead of python to run your script, see here.

Thanks. I successfully ran distributed training when continue pre-train, but --per_device_train_batch_size can only be set to a maximum of 8, increasing to 16 will report an error CUDA out of memory. But I use the LineByLineTextDataset to write the following script:

tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
model = RobertaForMaskedLM.from_pretrained("roberta-base")
print(model.num_parameters())

train_dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=f"{data_dir}/train_codes.txt",
    block_size=128,
)

test_dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=f"{data_dir}/valid_codes.txt",
    block_size=128,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

training_args = TrainingArguments(
    output_dir=model_dir,
    overwrite_output_dir=True,
    num_train_epochs=50,
    per_gpu_train_batch_size=64,
    save_steps=5000,
    do_eval=True,
    logging_dir=log_dir,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset = test_dataset
)

trainer.train()
trainer.save_model(model_dir)
tokenizer.save_pretrained(tokenizer_dir)

Using the same training data, my script can handle up to 64 batches per GPU, while RUN_mlm.py can handle only 8 batches per GPU. Why?
Can pyTorch Launcher be used to run distributed training using LineByLineTextDataset?

Jul 13 '22 03:07 skye95git

"Using deprecated --per_gpu_train_batch_size argument which will be removed in a future version. Using --per_device_train_batch_size is preferred." per_device_train_batch_size specifies the batch size to be processed by each GPU, right?

Jul 13 '22 07:07 skye95git

@sgugger I used the 'LineByLineTextDataset' script as above to continue pre-train Roberta on multiple cards in a single machine. It seemed to be an unbalanced load. Is the single-machine multi-card of LineByLineTextDataset implemented with DataParallel? Is there an implementation of DistributedDataParallel?

Jul 14 '22 02:07 skye95git

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Aug 08 '22 15:08 github-actions[bot]

transformers transformers copied to clipboard

dictionary update sequence element #5 has length 1; 2 is required

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard