ELMoForManyLangs icon indicating copy to clipboard operation
ELMoForManyLangs copied to clipboard

ELMoForManyLangs as Keras Layer

Open juckeltour opened this issue 5 years ago • 13 comments

Hi!

Does anybody know how to use Embedder.sents2elmo() as a Layer for Keras?

I mean like using the one from tensorflow hub, see https://www.depends-on-the-definition.com/named-entity-recognition-with-residual-lstm-and-elmo/.

Thanks in advance!

juckeltour avatar Mar 10 '19 11:03 juckeltour

You can create a Keras Sequence, in which you can apply the embedder on the input sequences, and use model.fit_on_generator/predict_on_generator

nhatsmrt avatar Mar 12 '19 09:03 nhatsmrt

Thanks for answering! I think we can use the Lambda layer or create a custom one. I don' know how to handle the Tensors that keras gives and return a data structure that keras accepts.

e = Embedder('...')
sess = tf.Session()
def ElmoEmbedding(x):
    with sess.as_default():
        return tf.convert_to_tensor(e.sents2elmo(x.eval())[0]) # this does not work
...
embedding = Lambda(ElmoEmbedding, output_shape=(None, max_len, 1024))(input_text)

juckeltour avatar Mar 13 '19 13:03 juckeltour

how about hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)`

muximus3 avatar Mar 14 '19 09:03 muximus3

Yeah, but I would like to use non-english Embeddings from ELMoForManyLangs and need to be able to retrain them with custom training data. tfhub is no option, I think.

juckeltour avatar Mar 14 '19 13:03 juckeltour

Hey juckeltour did you solve the problem?

I found a tutorial that comes close to a task that i try to solve. let me know if you found a solution

maybe this will help you out :

http://hunterheidenreich.com/blog/elmo-word-vectors-in-keras/

Rusiecki avatar Mar 30 '19 21:03 Rusiecki

Hi @juckeltour, have you found the way to use ELMoForManyLangs embeddings with Keras

marinkreso95 avatar Apr 15 '19 18:04 marinkreso95

can anyone find any solution?

ghost avatar Jun 23 '19 08:06 ghost

can anyone find any solution?

Hi guys, anyone found a solution yet?

ai-nlp avatar Jul 08 '19 10:07 ai-nlp

No, I didn't solve this problem.

We switched to BERT (and pytorch)...

juckeltour avatar Jul 10 '19 10:07 juckeltour

So, I have a workaround but it is somewhat impractical. Obtain the vocabulary of your dataset, then create an embedding file similar to a word2vec or glove file (ie. word 1024-dim-weights per line). And then implement an custom weight embedding layer, it worked ok for Spanish.

bazzmx avatar Jan 28 '20 14:01 bazzmx

@bazzmx do you have a code snippet?

erk4n avatar Jan 30 '20 11:01 erk4n

This is based on this blog post

First generate a list of unique words in your vocabulary and obtain their corresponding elmo embbedings using elmoformanylangs using sent2elmo and save to a file that contains one word and its 1024_dim_weights per line. In this example this file is emb_table.txt and the words that compose my vocabulary are the set of lemmas and words that will be used during training and testing

vocab = list(set(list(words_train)+list(lemmas_train))) # unique words in your vocabulary
enumerated_vocab = enumerate(sorted(vocab), 1) # indexed words
index_labels = {} # index_no:word/label dict
for i in enumerated_vocab:
    index_labels[i[0]]=i[1] # appends words and indexes to dict
labels_index = {v:k for k,v in index_labels.items()} # reversed dict = word/labels:index

With these dictionaries then you create the index to embedding matrix. Now that you are working with indexes you have to remember to convert all your words (string) to these indexes (int32).

embeddings_index = {}
f = open("./emb_table.txt", encoding="utf8") # opens generated elmo embeddings file
for line in f:
    values = line.split() # splits each line
    word = values[0] # first value is the word entry
    coefs = np.asarray(values[1:], dtype='float32') # the rest is converted to an embedding array
    embeddings_index[word] = coefs # appends data to dict
f.close()

Now we generate an embedding matrix putting all the pieces together:

EMBEDDING_DIM = 1024 # elmo's default size
embedding_matrix = np.zeros((len(vocab) + 1, EMBEDDING_DIM)) # creates a nx1024 matrix
for word, i in labels_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

With all this elements then you add an input layer and an embedding layer to your model, the input will be int32 because you have to convert each word to its index, then the embedding layer will assign the corresponding weights to each index:

input_words = Input(shape=(max_len,), dtype="int32", name="input_words")
EMB = Embedding(len(vocab) + 1,
                EMBEDDING_DIM,
                weights=[embedding_matrix],
                input_length=(max_len,),
                trainable=False, name="embedding")

emb_text = EMB(input_words)

Your input words should be an array of indexes that you can convert again to words using the dictionaries that were created.

This is an impractical workaround but it does the job for now, the key limitations are that you don't have acces to the embeddings in the same way that you would using tfhub.

I tried using the lambda layer approach but I ended up getting errors related to tensors and map_fn, etc,

bazzmx avatar Jan 30 '20 11:01 bazzmx

Anyone was able to make text classification in Keras successfully with sentence vectors?

DuyguA avatar Feb 15 '20 22:02 DuyguA