excord icon indicating copy to clipboard operation
excord copied to clipboard

About PlaceHolder Token <PH>

Open xyltt opened this issue 2 years ago • 2 comments

Hello,

I found you add a placeholder token <PH> to the tokenizer using the follow code:

  tokenizer._add_tokens(["<PH>"], special_tokens=True)
  tokenizer.placeholder_token = "<PH>"

And the placeholder token is used in the follow code:

    encoded_ph = tokenizer.convert_tokens_to_ids(tokenizer.placeholder_token)
    
    if len(truncated_rewrite) > len(truncated_query):
        truncated_query   += [encoded_ph] * (len(truncated_rewrite) - len(truncated_query))
    else:
        truncated_rewrite += [encoded_ph] * (len(truncated_query) - len(truncated_rewrite))

However, the index of this placeholder token has exceeded the size of the pre-trained vocabulary, so the embedding representation of this token is not available on the embedding table. How can this problem be solved? Do you need to replace placeholder token with existing tokens in the vocab? So what should I replace it with?

xyltt avatar Dec 25 '22 14:12 xyltt

Hello, xyltt Thanks for the detailed question.

As you said, I added the special token to the tokenizer's vocabulary and the model would learn an embedding representation during the training phase by

model.resize_token_embeddings(len(tokenizer))

You can refer to this line. Note that I didn't use this one in the older version of Transformers library (I don't know why exactly but it worked without issues). But it is required in the current versions

Thanks for your attention to our work.

gankim avatar Jan 02 '23 03:01 gankim

Thanks for your reply!

I also have some questions about the coqa dataset. I want to make sure that this released code is also applicable to the coqa dataset? I found that the "class_num" for coqa isn't equal to "class_num" for quac. So how many are the "class_num" for coqa? And the fourth label is ignored when calculating the "class_loss" as the following code:

     else: # coqa
            class_loss_fct = CrossEntropyLoss(ignore_index=3)
            class_loss = class_loss_fct(class_logits, is_impossible)

I want to know why, and what is the ignored label.

I also noticed that the "class_logits" is not being used during inference for quac dataset. Is the "class_logits" used during inference for coqa dataset?

xyltt avatar Jan 02 '23 08:01 xyltt