Failing assert len(input_ids) == max_seq_length in data_process.get_context_representation()
It looks like this function will happily build a sequence longer than the specified max_seq_length, which is 32 in the provided models.
In your own testing did you use a heuristic to truncate some combination of the left context, mention, and right context?
Thank you!
On a related note, are there trained models available that have a higher max_seq_length than 32?
I have more information on this issue. It occurs when you have a mention string that has exactly max_seq_length - 2 tokens.
This means that left_quota gets set to 0 (which makes sense since you have no room for the context). However, things break down in the following code block:
context_tokens = (
context_left[-left_quota:] + mention_tokens + context_right[:right_quota]
)
In this case, context_left[-left_quota:] ends up being context_left[0:], which ends up grabbing the entire context_left instead of none of it, which is what left_quota == 0 should imply.
The root problem is it is highly unlikely a legitimate mention string would have 30 tokens. I will filter what I am passing to BLINK to drop overly long mentions. But that said, if you want to avoid crashing due to the assert, you may want to test for the case where left_quota == 0.
@rogerbock Thanks for raising the issue! Yes please filter out the long mention string at the moment. We'll fix the corner case where left_quota == 0.
@ledw, @ledw-2: Can you please comment on availability of pre-trained models for max_seq_length greater than 32?
I think 32 captures very little of context surrounding the entity, especially BERT's tokenizer typically splits a word in 2-3 tokens. So, in effect you are only getting at most 5-10 words on left and right side of the mention, which doesn't capture much context.
@rogerbock: Were you able to freshly train a new model with longer context?
Thanks!
@rogerbock: There is an open PR for the bug you reported: https://github.com/facebookresearch/BLINK/pull/98
With regards to training a new model with higher context length, yes, it is indeed possible to do so. I would recommend training a zero-shot learning (zeshel) model first just to get hang of the training. The scripts to download and pre-process zeshel data are in the repository. You can then replicat the same steps, modify any hyperparameters (such as context length) and train your own model.