allennlp
allennlp copied to clipboard
Implement CopyNext model
https://api.semanticscholar.org/CorpusID:225103260
This is similar to CopyNet
from Gu et al., 2016, but adds an inductive bias for copying contiguous tokens.
~~The implementation is actually just a simple extension of CopyNet
: CopyNext
introduces a special symbol in the target vocabulary - the "CopyNext Symbol" (CN) - which corresponds to the operation of copying another token. So the CN token always follows a copied token that is the first in its span or another CN token.~~ -- As @JohnGiorgi pointed out below, CopyNext uses a separate linear layer to calculate the "CN" score. And there are some other differences with this model as well. Most notably, they make a big simplification by treating all tokens in the source sequence as unique, and not worrying about the case where a source token may be part of the target vocabulary. This makes sense for the task that CopyNext was applied to (nested NER), but not general seq2seq tasks.
I'd be very interested in using this model for my own project. I think I could probably take this on. Do you have a rough idea on where the modifications would go and what they would look like?
In the meantime I'll dig into the existing code and CopyNet/CopyNext.
That would be great @JohnGiorgi!
So, I think you want to start by adding a parameter to the CopyNetDatasetReader
and CopyNet
model classes for the "CN" token: copy_next_token: Optional[str] = None
(or something like that). By default this is None
, meaning the model is just regular CopyNet
.
Now I could be wrong, but I think the only place in the model implementation that would need to be updated is the _get_predicted_tokens
method: https://github.com/allenai/allennlp-models/blob/main/allennlp_models/generation/models/copynet_seq2seq.py#L853, so maybe this isn't as hard as I thought. I don't think the loss calculation needs to change in any way. But you may want to double check the "CopyNext" paper just to make sure. Let me know if you have any other questions!
Okay I think I get the basic idea,
Introduce a copy_next_token: Optional[str] = None
as an additional argument to CopyNet
. I believe this token would need to be added to self._target_namespace
, because its something the model generates (right?). So in __init__
, call
self._copy_next_index = self.vocab.add_token_to_namespace(copy_next_token, self._target_namespace)
Modify _get_predicted_tokens
so that when self._copy_next_index
is predicted, we take the predicted token to be the immediate next token in the source as was previously copied. The relevant part of _get_predicted_tokens
would look something like:
for index in indices:
# The model predicted the CopyNext token (CN).
# Take the token to be the next adjacent token in source_tokens
if index == self._copy_next_index:
# TODO: Get the position of the previously copied token in source_tokens
prev_copy_source_index = ...?
# TODO: Take token to be the immediate next token in the source of the previously copied one
# call it "copy_next_source_index" for now.
copy_next_source_index = prev_copy_source_index + 1
token = metadata["source_tokens"][copy_next_source_index]
elif index >= self._target_vocab_size:
adjusted_index = index - self._target_vocab_size
token = metadata["source_tokens"][adjusted_index]
else:
token = self.vocab.get_token_from_index(index, self._target_namespace)
tokens.append(token)
A couple of questions...
-
_get_predicted_tokens
currently only accumulates thetokens
(on this line). I think for this to work, we need to also accumulate the index insource_tokens
that each copy operation chose. I am stuck on how to do this. - Three edges cases come to mind where the model predicts
copy_next_token
, but the copy next operation doesn't make sense:- when the model's very first prediction is
copy_next_token
. - when the model predicts
copy_next_token
right after generating something from the target vocabulary. - when the model copies the very last token in
source_tokens
, and then generatescopy_next_token
.
- when the model's very first prediction is
One solution I can think of is masking out self._copy_next_index
in the decoder's softmax in each of these situations, so it cannot erroneously choose the copy next token. Does this sound reasonable? AFAICT that is what CopyNext is doing (see Figure 3 of their paper).
I believe this token would need to be added to self._target_namespace (right?)
Yeup!
Modify _get_predicted_tokens so that when self._copy_next_index is predicted, we take the predicted token to be the immediate next token in the source as was previously copied
Exactly.
For your questions:
- For a copied token,
adjusted_index
is the index of the token with respect to the source sequence: https://github.com/allenai/allennlp-models/blob/3624fdc4b249d3b515269ee99e1b1e15bf876829/allennlp_models/generation/models/copynet_seq2seq.py#L877. So then you could just increment that when the next token is acopy_next_token
(would have to keep track of whether or not you are in one of your edge cases). - In practice I don't think any of these would be much of an issue. During training the model will quickly learn that the
copy_next_token
only follows thecopy_token
or anothercopy_next_token
. Masking it out certainly makes sense, but that could be tricky to implement efficiently.
Got it, thanks for the guidance. I think next steps are clear, I will work on this and open a PR to allennlp-models
soon!
Actually, I am thinking this might be a good bit more complicated than just updating _get_predicted_tokens
.
In CopyNext, the model assigns a score to the copy next operation by passing the decoder's current hidden state through a linear layer. If we wanted to match this, we would need to:
Add a new linear layer at model initialization:
self._output_copy_next_layer = Linear(self.decoder_output_dim, 1)
Add a method to compute the copy next score:
def _get_copy_next_scores(self, state: Dict[str, torch.Tensor]) -> torch.Tensor:
return self._output_copy_next_layer(state["decoder_hidden"])
AFAICT the copy next score is based solely on a linear transformation of the decoder hidden state. This is what I gather from the figure:

and the text:

I think this significantly complicates things, downstream, particularly anywhere all_scores
is being calculated. Also in detecting when the copy next operation was the most likely (over generating and copying) and filling in the copy next token for these timesteps.
Does this sound right?
Ah, yes you're right. Their implementation actually seems a lot simpler than CopyNet. And I think this is because they don't worry about the case where tokens in the source sequence may be in the target vocabulary ($\mathcal{L}$), which makes sense for the task they applied it to.
So I think we have a couple options:
- Continue with the original plan, doing a simple modification to CopyNet. This doesn't give us the CopyNext model, but it does probably improve CopyNet, and is may also be a better model than CopyNext for tasks where some source tokens may be in the target vocab.
- Implement the true CopyNext model from scratch.
- Do both 1 and 2 🤷♂️
What do you think?
What if I go for 1. right now? I think I could do this inside a week. I could test it on my own task, joint NER and RE, that I currently use CopyNet for. I can see if there's a performance boost.
I think I would still be interested in trying out CopyNext, so I could attempt to implement it from scratch as well. This might take me a little longer though. The authors list a GitHub in the paper (https://github.com/abhinonymous/copynext), but it gives me a 404. I will reach out and see if there's a public implementation somewhere.
Sounds good @JohnGiorgi!
@JohnGiorgi / @epwalsh I'm interested in CopyNext as well, I may be able to assist with 2 once I do a bit more research into what's involved.