LTP
LTP copied to clipboard
question about the max seq length
🖥 Benchmarking transformers
Hi there,
When I run one of the examples in the text classification folder, and pass max_seq_length =1024 to the model, I got the following warning, which says: WARNING - main - The max_seq_length passed (1024) is larger than the maximum length for the model (512). Using max_seq_length=512.
Set-up
I'm runing on GPU node with the following command.
python ./examples/text-classification/run_glue.py
--model_name_or_path bert-base-cased
--task_name mrpc
--do_train
--do_eval
--max_seq_length 1024
--per_device_train_batch_size 8
--learning_rate 2e-5
--num_train_epochs 1
--overwrite_output_dir
--output_dir /tmp/mrpc/
It can still give me a output. But instead of using the max_seq_length as 1024, it uses max_seq_length=512.
I'm wondering if this is due to the model is still limited to the 512 max token length in memory requirement like most transformer and bert-based models. Or is this caused by the default configuration in the pre-training process? And in the paper, the author mentioned two settings and one of them is 1024, so how can I get the pretained model with max_seq_length=1024? Thanks!
Hi,
Yes, the pre-trained models like BERT and RoBERTa cannot be finetuned using longer sequence lengths than the maximum sequence length that they were pre-trained on as it will violate the pre-defined positional embedding rules, etc. This is why it is prevented from the huggingface implementation and errors out when attempts to increase max-seq-length above 512. Since the current version of LTP is implemented on top of RoBERTa model as a baseline, and has the same issue.
The setting of sequence length 1024 you found in the paper (in section A.2, probably?) is to demo the effect of long sequence lengths on processing latency and did not require a pre-trained checkpoint.
The possible workaround that I would suggest is to find a checkpoint that has been trained with longer sequence lengths (there might be some models specialized in processing long documents) and extend/migrate the LTP implementation to that model class.
Hope this helps answer your question.
Hi Sehoon, Thanks for your reply. It's very helpful! I recently came cross several works which have addressed the 512 token length limitation, e.g., longformer and bigbird, which leverage the memory requirement by modifying the attention mechanism. Those models are pretrained on max_sequence_length over 512. I'm concerned that if LTP implementation can be extended/migrated to those models without pre-training.
And another quick question about your token pruning implementation. Is it possible to reconstruct the pruned tokens as demonstrated in Figure. 2 in the paper and make the final pruning result interpretable for the downstream task? Thanks!