Dan Kondratyuk
Dan Kondratyuk
Aside from using FLAIR's specific implementation, there could be a lot of use in creating a generic sentence-level character encoder. I've seen a slightly different formulation here: https://arxiv.org/abs/1805.08237. The authors...
I looked at this a bit more and noticed a potential issue with implementing an indexer. The `tokens_to_indices` method in an indexer accepts a list of `Token` objects, but this...
Yes, but the BERT wordpieces ignore tokenized whitespace, while FLAIR uses it. Currently, indexers all assume the input is pre-tokenized, but we need the raw text with the whitespace. But...
To compute the offsets, we also need to know the word boundaries from the tokenized text as well. We need two pieces of information, but `List[Token]` only allows for one.
No, but you would either (1) add boundary separator tokens beforehand, or (2) make assumptions on how the text was originally tokenized. For instance, if you have the tokens, ["go",...
Yes, that's exactly right, but to compute word-level embeddings, you need to also return indices representing the span of each word. In the case of `["g", "o", "."]`, it would...
That's assuming you tokenized with Spacy. But what if I tokenized with my own tokenizer, or my text is pre-tokenized? Hence, the options I listed above.
I guess we can leave it at that, then. But I was hoping to create a generic sentence-level character encoder that could I could use with any dataset. E.g., I...
And again, if we go with the Spacy tokenizer, we may still need to modify the `tokens_to_indices` method to either pass in an extra `offsets` parameter or a `Sentence` object...
It's entirely possible to do this automatically without needing to modify the current dataset readers. Maybe it would be more useful as a utility function. In any case, it's no...