{"mask"} automatically assigns <extra_id_0>, which conflicts with the task of masked filling

Open ChristLBUPT opened this issue 2 years ago • 1 comments

I want to do prompt tuning for a masked-fill-based T5 model, which has the input format like this:

test_dataset = [
    InputExample(text_a="The quick <extra_id_0> fox <extra_id_1> over the lazy dog", tgt_text="<extra_id_0> brown <extra_id_1> jumps <extra_id_2>"),
    InputExample(text_a="The Capital city of China is <extra_id_0>, which has a <extra_id_1> of 20 million", tgt_text="<extra_id_0> Beijing <extra_id_1> population <extra_id_2>")
]

if I use the template similar to that given by 2.1_conditional_generation.py, that is:

template = ManualTemplate(t5tokenizer, '{"placeholder": "text_a"} {"special": "<eos>"} {"mask"}')

it will automatically assign an <extra_id_0>at the corresponding position of {"mask"}, splitting the original sentence and target sentence with special token </s>, which results in duplicate <extra_id_0>s in input sentence, just as follows:

The Capital city of China is<extra_id_0>, which has a<extra_id_1> of 20 million <extra_id_0>

I know it is possible to manually add 1 to each extra_id in my dataset, but is it possible to ONLY use the source sentence as input and avoid automatically adding extra ids?

Jan 16 '23 07:01 ChristLBUPT

I am not sure I understand your question correctly. It seems that you are not using {"mask"} and therefore not using verbalizer on the whole. That way you probably should just use the transformers library, without wrapping it up with PromptModel in openprompt.

Mar 30 '23 07:03 yulinchen99