Dan Kondratyuk comments

Results 32 comments of


                                            Dan Kondratyuk

trafficstars

Include Flair Embeddings

Aside from using FLAIR's specific implementation, there could be a lot of use in creating a generic sentence-level character encoder. I've seen a slightly different formulation here: https://arxiv.org/abs/1805.08237. The authors...

Include Flair Embeddings

I looked at this a bit more and noticed a potential issue with implementing an indexer. The `tokens_to_indices` method in an indexer accepts a list of `Token` objects, but this...

Include Flair Embeddings

Yes, but the BERT wordpieces ignore tokenized whitespace, while FLAIR uses it. Currently, indexers all assume the input is pre-tokenized, but we need the raw text with the whitespace. But...

Include Flair Embeddings

To compute the offsets, we also need to know the word boundaries from the tokenized text as well. We need two pieces of information, but `List[Token]` only allows for one.

Include Flair Embeddings

No, but you would either (1) add boundary separator tokens beforehand, or (2) make assumptions on how the text was originally tokenized. For instance, if you have the tokens, ["go",...

Include Flair Embeddings

Yes, that's exactly right, but to compute word-level embeddings, you need to also return indices representing the span of each word. In the case of `["g", "o", "."]`, it would...

Include Flair Embeddings

That's assuming you tokenized with Spacy. But what if I tokenized with my own tokenizer, or my text is pre-tokenized? Hence, the options I listed above.

Include Flair Embeddings

I guess we can leave it at that, then. But I was hoping to create a generic sentence-level character encoder that could I could use with any dataset. E.g., I...

Include Flair Embeddings

And again, if we go with the Spacy tokenizer, we may still need to modify the `tokens_to_indices` method to either pass in an extra `offsets` parameter or a `Sentence` object...

Include Flair Embeddings

It's entirely possible to do this automatically without needing to modify the current dataset readers. Maybe it would be more useful as a utility function. In any case, it's no...