transformers
transformers copied to clipboard
dictionary update sequence element #5 has length 1; 2 is required
System Info
transformersversion: 4.20.0.dev0- Platform: Linux-5.4.0-66-generic-x86_64-with-glibc2.31
- Python version: 3.9.12
- Huggingface_hub version: 0.7.0
- PyTorch version (GPU?): 1.11.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
transformers/examples/pytorch/language-modeling/run_mlm.py @LysandreJik @sgugger
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
I want to pre-train RoBERTa from scratch on my own dataset using transformers/examples/pytorch/language-modeling/run_mlm.py.
- I run the command:
python run_mlm.py \
--model_type roberta \
--tokenizer_name /CodeSearchNet/code_txt/tokenizer \
--config_overrides vocab_size=52_000,max_position_embeddings=514,num_attention_heads=12,num_hidden_layers=12,type_vocab_size=1 \
--train_file /data_for_train_tokenizer/CodeSearchNet/train_codes.txt \
--validation_file /data_for_train_tokenizer/CodeSearchNet/valid_codes.txt \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 64 \
--num_train_epochs 100 \
--overwrite_output_dir \
--line_by_line \
--save_steps 5000 \
--do_train \
--do_eval \
--output_dir /CodeSearchNet/code_txt/model/pretrain_Roberta_from_scratch/CSN/single_file \
--logging_dir /CodeSearchNet/code_txt/log/pretrain_Roberta_from_scratch_CSN_single_file
There is an error:
07/09/2022 02:00:22 - WARNING - __main__ - You are instantiating a new config instance from scratch.
07/09/2022 02:00:22 - INFO - __main__ - Overriding config: vocab_size=52_000,max_position_embeddings=514,num_attention_heads=12,num_hidden_layers=12,type_vocab_size=1,
Traceback (most recent call last):
File "/transformers/examples/pytorch/language-modeling/run_mlm.py", line 612, in <module>
main()
File "/transformers/examples/pytorch/language-modeling/run_mlm.py", line 359, in main
config.update_from_string(model_args.config_overrides)
File "/transformers/src/transformers/configuration_utils.py", line 850, in update_from_string
d = dict(x.split("=") for x in update_str.split(","))
ValueError: dictionary update sequence element #5 has length 1; 2 is required
How to set config_overrides in run_mlm.py?
- When I set
per_device_eval_batch_size 64, there is an error:
RuntimeError: CUDA out of memory. Tried to allocate 21.48 GiB (GPU 0; 39.59 GiB total capacity; 26.26 GiB already allocated; 11.40 GiB free; 26.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
0%| | 0/175900 [00:27<?, ?it/s]
Load imbalance caused by data parallelism. How to set up distributed data parallelism in trainer?
Expected behavior
Be able to train Roberta from scratch in DDP mode using large batch size.
Are you sure you pasted the exact command you ran? I have no error when trying it on my side and the config is successfully updated. To use distributed training, just use the pytorch launcher instead of python to run your script, see here.
Are you sure you pasted the exact command you ran? I have no error when trying it on my side and the config is successfully updated. To use distributed training, just use the pytorch launcher instead of
pythonto run your script, see here.
Yes. I'm sure. Maybe I should change --config_overrides vocab_size=52_000,max_position_embeddings=514,num_attention_heads=12,num_hidden_layers=12,type_vocab_size=1 to --config_overrides "vocab_size=52_000,max_position_embeddings=514,num_attention_heads=12,num_hidden_layers=12,type_vocab_size=1"? In other words, should quotes be added to the config_overrides parameter?
Are you sure you pasted the exact command you ran? I have no error when trying it on my side and the config is successfully updated. To use distributed training, just use the pytorch launcher instead of
pythonto run your script, see here.
Thanks. I successfully ran distributed training when continue pre-train, but --per_device_train_batch_size can only be set to a maximum of 8, increasing to 16 will report an error CUDA out of memory. But I use the LineByLineTextDataset to write the following script:
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
model = RobertaForMaskedLM.from_pretrained("roberta-base")
print(model.num_parameters())
train_dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path=f"{data_dir}/train_codes.txt",
block_size=128,
)
test_dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path=f"{data_dir}/valid_codes.txt",
block_size=128,
)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
training_args = TrainingArguments(
output_dir=model_dir,
overwrite_output_dir=True,
num_train_epochs=50,
per_gpu_train_batch_size=64,
save_steps=5000,
do_eval=True,
logging_dir=log_dir,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset = test_dataset
)
trainer.train()
trainer.save_model(model_dir)
tokenizer.save_pretrained(tokenizer_dir)
Using the same training data, my script can handle up to 64 batches per GPU, while RUN_mlm.py can handle only 8 batches per GPU. Why?
Can pyTorch Launcher be used to run distributed training using LineByLineTextDataset?
"Using deprecated --per_gpu_train_batch_size argument which will be removed in a future version. Using --per_device_train_batch_size is preferred."
per_device_train_batch_size specifies the batch size to be processed by each GPU, right?
@sgugger I used the 'LineByLineTextDataset' script as above to continue pre-train Roberta on multiple cards in a single machine. It seemed to be an unbalanced load.
Is the single-machine multi-card of LineByLineTextDataset implemented with DataParallel? Is there an implementation of DistributedDataParallel?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.