{"mask"} automatically assigns <extra_id_0>, which conflicts with the task of masked filling
I want to do prompt tuning for a masked-fill-based T5 model, which has the input format like this:
test_dataset = [
InputExample(text_a="The quick <extra_id_0> fox <extra_id_1> over the lazy dog", tgt_text="<extra_id_0> brown <extra_id_1> jumps <extra_id_2>"),
InputExample(text_a="The Capital city of China is <extra_id_0>, which has a <extra_id_1> of 20 million", tgt_text="<extra_id_0> Beijing <extra_id_1> population <extra_id_2>")
]
if I use the template similar to that given by 2.1_conditional_generation.py, that is:
template = ManualTemplate(t5tokenizer, '{"placeholder": "text_a"} {"special": "<eos>"} {"mask"}')
it will automatically assign an <extra_id_0>at the corresponding position of {"mask"}, splitting the original sentence and target sentence with special token </s>, which results in duplicate <extra_id_0>s in input sentence, just as follows:
The Capital city of China is<extra_id_0>, which has a<extra_id_1> of 20 million <extra_id_0>
I know it is possible to manually add 1 to each extra_id in my dataset, but is it possible to ONLY use the source sentence as input and avoid automatically adding extra ids?
I am not sure I understand your question correctly. It seems that you are not using {"mask"} and therefore not using verbalizer on the whole. That way you probably should just use the transformers library, without wrapping it up with PromptModel in openprompt.