allennlp Include Flair Embeddings

It would be great to see a token embedder for Flair embeddings. They have released an extensive toolkit, including pretrained models, so in theory it could be straightforward to incorporate them.

A complication is that they operate on the character level over the entire sentence, so in order to get word embeddings, one needs to include spans indicating character offsets for each word. The actual values are much different, but the idea is similar in principle to the BERT offsets. Presumably there would need to be a Flair token indexer as well.

May 06 '19 20:05 mayhewsw

We'd certainly be open to taking a PR for this assuming the issue you highlight can be resolved in a clean fashion. The other issue that we're unsure about is whether this would entail adding Flair itself as a dependency which we would like to avoid. Thank you.

May 10 '19 22:05 brendan-ai2

I've made a quick and dirty implementation, but it does indeed add Flair as a dependency, which I totally agree is not so clean. That said, the code is relatively simple, perhaps it could be implemented directly in allennlp.

May 11 '19 00:05 mayhewsw

So, yeah, given that, I'd say there are two options: (1) keep this as a separate add on to allennlp, that adds a few Registrable components if you want them, so we don't add the dependency directly to allennlp (bonus if it's also pip-installable), or (2) do whatever needs to be done so we can load and use flair embeddings without having to import flair. I have nothing against flair, we just already get a bunch of complaints about too many dependencies in the core library, and requests to split things out.

May 11 '19 04:05 matt-gardner

Not sure I understand option 1: "separate add on" means, for example, my code stays in my repo, but can be easily added to allennlp (maybe with pip)? I like this idea.

May 11 '19 16:05 mayhewsw

Yeah, it's basically like a separate allennlp-contrib repo. We've talked about maintaining one of these ourselves, but I don't think we're ready to do that at this point - maybe someday we'll split things out a bit more, and then something like this would make sense for us to do. But if you want to maintain a repo with additional pip-installable components, I'd say go for it. I think all you would have to do would be to use --include-package with whatever package you pip installed.

May 11 '19 16:05 matt-gardner

Aside from using FLAIR's specific implementation, there could be a lot of use in creating a generic sentence-level character encoder. I've seen a slightly different formulation here: https://arxiv.org/abs/1805.08237. The authors concatenate all four edge states for each word, while FLAIR only concatenates two of the four states.

Seems like character-level word embeddings computed on the entire sentence can offer a boost in evaluation performance over computed just on each word, even without pretraining with a LM.

May 13 '19 16:05 Hyperparticle

I looked at this a bit more and noticed a potential issue with implementing an indexer. The tokens_to_indices method in an indexer accepts a list of Token objects, but this is insufficient to represent the information we need. I.e., the embedder needs to know (1) the word tokens (or alternatively the character offsets) that segment the raw text and (2) the raw text itself. If we just have the word tokens, then we are missing information about separator tokens like whitespace (or no separator). If we just have the raw text, we can't compute offsets for each word.

Unless I'm missing something obvious, there would be required changes along the lines of:

Replace List[Token] with a Sentence object which can optionally store the raw text. Then the indexer could use a simple algorithm to scan substrings of tokens in the raw text to find the offsets. Alternatively, the Sentence could compute this internally.
Require List[Token] to be formatted a certain way. For instance, each Token represents a character in the raw text, with special [WORD_START]/[WORD_END] tokens that denote word boundaries. This would need a custom tokenizer which may not work with current DatasetReaders without code changes.
Precompute the offsets when tokenizing. This is likely not easily interoperable with existing tokenizers.
Ignore any intermediate tokens entirely and just use one space between each word. This is would be the simplest and require no interface changes. But this might also cause issues with precomputed FLAIR integration as it is trained with the raw text in mind.

I'm inclined to choose 1., as it would work well with existing tokenizers and dataset readers and would be easier to change in the future. Some datasets already supply tokenized words and so do not have the raw text available. In that case, approximating the raw text by adding a default space separator between word tokens as in 4. could be a compromise.

What do you all think? @joelgrus @matt-gardner

May 22 '19 20:05 Hyperparticle

I don't know much about Flair embedddings, but I took a quick look at their paper and it looks like they're just doing character-level embeddings and then taking the last character after each word? This doesn't seem conceptually different from what we're doing for e.g. BERT, where we get one embedding per wordpiece and then (potentially) take the first or last embedding for each word?

May 22 '19 20:05 joelgrus

Yes, but the BERT wordpieces ignore tokenized whitespace, while FLAIR uses it. Currently, indexers all assume the input is pre-tokenized, but we need the raw text with the whitespace. But we also need to know where the word boundaries are.

May 22 '19 20:05 Hyperparticle

wouldn't you just use the character tokenizer (which would keep spaces) and then compute the offsets in the token indexer?

May 22 '19 20:05 joelgrus

To compute the offsets, we also need to know the word boundaries from the tokenized text as well. We need two pieces of information, but List[Token] only allows for one.

May 22 '19 20:05 Hyperparticle

are the rules for word boundaries that complicated that you couldn't just include them in the token indexer?

May 22 '19 20:05 joelgrus

No, but you would either (1) add boundary separator tokens beforehand, or (2) make assumptions on how the text was originally tokenized. For instance, if you have the tokens, ["go", "."], was the raw text "go.", or "go ."? Might not make a huge difference, but It's something to consider.

May 22 '19 21:05 Hyperparticle

what does "originally tokenized" mean here?

say I have a sentence "go."

I feed that to the character tokenizer and get ["g", "o", "."]

if the sentence were "go .", I would get ["g", "o", " ", "."]

May 22 '19 21:05 joelgrus

Yes, that's exactly right, but to compute word-level embeddings, you need to also return indices representing the span of each word.

In the case of ["g", "o", "."], it would be something like [(0, 1), (2, 2)].

In the case of ["g", "o", " ", "."], it would be [(0, 1), (3, 3)].

Where can we compute these boundaries? The list of tokenized words. But with the tokenized words alone , e.g., ["go", "."], we won't know if we have the first case or the second. So what I'm saying is that to faithfully represent the original sentence, we need both the token-level information that captures word boundary information and the raw text information that represents the characters of the words and also between the words (which tokenization erases). Right now, the tokens_to_indices method prevents an easy way to pass both pieces of information. Unless, of course, we just make a simple assumption like that every word token should have a single space between them (e.g., we always compute character-level info on ["g", "o", " ", "."]).

May 22 '19 21:05 Hyperparticle

ok, I think I get it now. but the spacy tokenizer is already returning the offsets as token.idx:

In [11]: t = WordTokenizer()                                                                            

In [12]: tokens = t.tokenize("This isn't it, chief.")                                                   

In [13]: for token in tokens: 
    ...:     print(token.idx, token) 
    ...:                                                                                                
0 This
5 is
7 n't
11 it
13 ,
15 chief
20 .

is that not sufficient for the token indexer?

May 22 '19 22:05 joelgrus

That's assuming you tokenized with Spacy. But what if I tokenized with my own tokenizer, or my text is pre-tokenized? Hence, the options I listed above.

May 22 '19 22:05 Hyperparticle

if your text is pre-tokenized you're out of luck in any case.

I am extremely comfortable enforcing "if you want to use flair embeddings, you must use a tokenizer that generates offsets (e.g. the default WordTokenizer)", that's much simpler than just about any other solution.

May 22 '19 22:05 joelgrus

I guess we can leave it at that, then.

But I was hoping to create a generic sentence-level character encoder that could I could use with any dataset. E.g., I primarily use Universal Dependencies, whose data already comes tokenized out of the box. Should I be forced to modify my dataset reader and tokenizers to be able to work with FLAIR? Or can we add a simple function that reconstructs the offsets from the given tokenization, if possible? In the case of no raw text available, then assuming one space between each word token could be sufficient.

May 22 '19 22:05 Hyperparticle

And again, if we go with the Spacy tokenizer, we may still need to modify the tokens_to_indices method to either pass in an extra offsets parameter or a Sentence object containing those offsets.

May 22 '19 22:05 Hyperparticle

in this case your DatasetReader must be (I assume) somehow creating Token objects to populate a TextField? in which case I'd say that yes it's the dataset reader's job to populate the idx fields of those tokens. if you're primarily using the same dataset, then that's just a small one-time hit to write that code?

May 22 '19 22:05 joelgrus

It's entirely possible to do this automatically without needing to modify the current dataset readers. Maybe it would be more useful as a utility function. In any case, it's no big deal.

Then my only remaining concern is how to pass both the character tokens and the offsets to the indexer. It will require a change to the indexer interface.

May 22 '19 22:05 Hyperparticle

look at how TokenCharactersIndexer.tokens_to_indices works:

https://github.com/allenai/allennlp/blob/master/allennlp/data/token_indexers/token_characters_indexer.py#L74

you'd basically just do that, except that you'd have to grab each token.idx and generate a second vector of offsets to return.

in fact, you could probably just add a new parameter to that token indexer

compute_offsets: bool = False

that if it's true it does that, and then you don't even need to write a new token indexer

May 22 '19 22:05 joelgrus

Ah, makes sense now. Thanks!

May 22 '19 22:05 Hyperparticle

@mayhewsw Would you be willing to share your implementation of including flair embeddings or some pointers on how you did it?

Jul 09 '19 22:07 zeeshansayyed

@zeeshansayyed At risk of embarrassing myself, here's a gist with my quick and dirty implementation: https://gist.github.com/mayhewsw/26939faf0a7190a6d174893a31ba0ac8

Jul 15 '19 19:07 mayhewsw

@dirkgr @matt-gardner if I open a PR based on @mayhewsw work is something that would be accepted?

May 23 '20 23:05 bratao

(Fine with me, fwiw)

May 23 '20 23:05 mayhewsw

Browsing over the code in the gist, I assume the scope of this is just to create the embeddings, but not to make it trainable, right?

May 25 '20 04:05 dirkgr

@dirkgr yes, this would a be embedding generator only.

It is possible but very tricky to implement it using only the AllenNLP. At least for me. As it is a char lm that use the embedding of the first white char after a word.

But the performance for NER is way better than anything else I tested.

May 25 '20 04:05 bratao

allennlp allennlp copied to clipboard

Include Flair Embeddings

allennlp
allennlp copied to clipboard