optimum
optimum copied to clipboard
Missing `token_type_ids` when different tokenizer and model
System Info
optimum==1.17.1
Who can help?
@philschmid @michaelbenayoun @JingyaHuang @echarlaix
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction (minimal, reproducible, runnable)
Hi I faced with a problem described here with missing 'token_type_ids'
.
The root of problem is when we used tokenizer different from model, for BERT
we don't need attention_mask
, but we need token_type_ids
. For example intfloat/multilingual-e5-small
is BertModel architecture, but use XLMRobertaTokenizer, and we faced with problem missing 'token_type_ids' when we convert it onnx or openvino and try to run it with optimum.
For example: I finetuned intfloat/multilingual-e5-small
for token-classification task and its works fine with transformer library.
But when I try to use pipeline
with ORTModelForTokenClassification
, token_type_ids
do not add in model input because tokenizer return attention_mask
instead token_type_ids
.
And problem in code is here
But when I added it manually, everything works: This fix will work:
tokenizer = XLMRobertaTokenizerFast.from_pretrained('models/small')
model = ORTModelForTokenClassification.from_pretrained('models/small-onnx/')
inputs = tokenizer('some text', return_tensors='pt')
inputs['token_type_ids'] = inputs['attention_mask']
model(**inputs)
Maybe we should check the inputs of model and decide add token_type_ids
or not, based on the input of model, not tokenizer output. Or we can do like in transformer lib. Here the link
They mask all token_type_ids
with zeroes if they not preserve.
Also this problem is refer for openvino OVModelForTokenClassification.
Expected behavior
Forward will be the same like in transformer lib