[training] fail to run the huggingface example when batch size as 1.
Hello, I try to run the example in lightseq/examples/training/huggingface. Because I use a game PC, so I sightly modify the run_ner.sh script (two lines as follows).
python3 -m torch.distributed.launch \
--nproc_per_node=1 \
$THIS_DIR/run_ner.py \
- --model_name_or_path bert-large-uncased \
- --per_device_train_batch_size 16 \
+ --model_name_or_path bert-base-uncased \
+ --per_device_train_batch_size 1 \
--dataset_name conll2003 \
--output_dir /tmp/test-ner \
--do_train \
The program will crash on this line.
File "/home/user/anaconda3/envs/deepspd/lib/python3.7/site-packages/lightseq/training/ops/pytorch/transformer_encoder_layer.py", line 288, in forward assert bs == encoder_padding_mask.size(0) and sl == encoder_padding_mask.size(1) AssertionError
The software version I used.
transformers 4.11.0 lightseq 2.1.4 torch 1.7.1+cu110 torchaudio 0.7.2 torchvision 0.8.2+cu110
Cuda compilation tools, release 11.1, V11.1.105
I guess the error comes from setting batch size as 1. If I set the per_device_train_batch_size as 2, It works.
Thanks, Jiarui. It seems like an assertion bug for 1 batch, we'll fix it. BTW, Are Turbo working on training? I'm looking forward to it.
Thanks, Jiarui. It seems like an assertion bug for 1 batch, we'll fix it. BTW, Are Turbo working on training? I'm looking forward to it.
Haha, thanks for your attention. Turbo will not (or never) support training :). Lightseq did an amazing job on this point. I appreciate your efforts in training acceleration. I test it on Bert training cases and noticed quite a significant speedup.