OpenPrompt icon indicating copy to clipboard operation
OpenPrompt copied to clipboard

{"mask"} automatically assigns <extra_id_0>, which conflicts with the task of masked filling

Open ChristLBUPT opened this issue 2 years ago • 1 comments

I want to do prompt tuning for a masked-fill-based T5 model, which has the input format like this:

test_dataset = [
    InputExample(text_a="The quick <extra_id_0> fox <extra_id_1> over the lazy dog", tgt_text="<extra_id_0> brown <extra_id_1> jumps <extra_id_2>"),
    InputExample(text_a="The Capital city of China is <extra_id_0>, which has a <extra_id_1> of 20 million", tgt_text="<extra_id_0> Beijing <extra_id_1> population <extra_id_2>")
]

if I use the template similar to that given by 2.1_conditional_generation.py, that is:

template = ManualTemplate(t5tokenizer, '{"placeholder": "text_a"} {"special": "<eos>"} {"mask"}')

it will automatically assign an <extra_id_0>at the corresponding position of {"mask"}, splitting the original sentence and target sentence with special token </s>, which results in duplicate <extra_id_0>s in input sentence, just as follows:

The Capital city of China is<extra_id_0>, which has a<extra_id_1> of 20 million <extra_id_0>

I know it is possible to manually add 1 to each extra_id in my dataset, but is it possible to ONLY use the source sentence as input and avoid automatically adding extra ids?

ChristLBUPT avatar Jan 16 '23 07:01 ChristLBUPT

I am not sure I understand your question correctly. It seems that you are not using {"mask"} and therefore not using verbalizer on the whole. That way you probably should just use the transformers library, without wrapping it up with PromptModel in openprompt.

yulinchen99 avatar Mar 30 '23 07:03 yulinchen99