optimum Missing `token_type_ids` when different tokenizer and model

Missing `token_type_ids` when different tokenizer and model

Open ZeusFSX opened this issue 11 months ago • 0 comments

System Info

optimum==1.17.1

Who can help?

@philschmid @michaelbenayoun @JingyaHuang @echarlaix

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Hi I faced with a problem described here with missing 'token_type_ids'.

The root of problem is when we used tokenizer different from model, for BERT we don't need attention_mask, but we need token_type_ids. For example intfloat/multilingual-e5-small is BertModel architecture, but use XLMRobertaTokenizer, and we faced with problem missing 'token_type_ids' when we convert it onnx or openvino and try to run it with optimum.

For example: I finetuned intfloat/multilingual-e5-small for token-classification task and its works fine with transformer library.

But when I try to use pipeline with ORTModelForTokenClassification, token_type_ids do not add in model input because tokenizer return attention_mask instead token_type_ids.

And problem in code is here

But when I added it manually, everything works: This fix will work:

tokenizer = XLMRobertaTokenizerFast.from_pretrained('models/small')
model = ORTModelForTokenClassification.from_pretrained('models/small-onnx/')
inputs = tokenizer('some text', return_tensors='pt')
inputs['token_type_ids'] = inputs['attention_mask']
model(**inputs)

Maybe we should check the inputs of model and decide add token_type_ids or not, based on the input of model, not tokenizer output. Or we can do like in transformer lib. Here the link

They mask all token_type_ids with zeroes if they not preserve.

Also this problem is refer for openvino OVModelForTokenClassification.

Expected behavior

Forward will be the same like in transformer lib

Mar 15 '24 14:03 ZeusFSX

optimum optimum copied to clipboard

Missing `token_type_ids` when different tokenizer and model

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

optimum
optimum copied to clipboard