ELMoForManyLangs
ELMoForManyLangs copied to clipboard
ELMoForManyLangs as Keras Layer
Hi!
Does anybody know how to use Embedder.sents2elmo() as a Layer for Keras?
I mean like using the one from tensorflow hub, see https://www.depends-on-the-definition.com/named-entity-recognition-with-residual-lstm-and-elmo/.
Thanks in advance!
You can create a Keras Sequence, in which you can apply the embedder on the input sequences, and use model.fit_on_generator/predict_on_generator
Thanks for answering! I think we can use the Lambda layer or create a custom one. I don' know how to handle the Tensors that keras gives and return a data structure that keras accepts.
e = Embedder('...')
sess = tf.Session()
def ElmoEmbedding(x):
with sess.as_default():
return tf.convert_to_tensor(e.sents2elmo(x.eval())[0]) # this does not work
...
embedding = Lambda(ElmoEmbedding, output_shape=(None, max_len, 1024))(input_text)
how about hub.Module("https://tfhub.dev/google/elmo/2",
trainable=True)`
Yeah, but I would like to use non-english Embeddings from ELMoForManyLangs and need to be able to retrain them with custom training data. tfhub is no option, I think.
Hey juckeltour did you solve the problem?
I found a tutorial that comes close to a task that i try to solve. let me know if you found a solution
maybe this will help you out :
http://hunterheidenreich.com/blog/elmo-word-vectors-in-keras/
Hi @juckeltour, have you found the way to use ELMoForManyLangs embeddings with Keras
can anyone find any solution?
can anyone find any solution?
Hi guys, anyone found a solution yet?
No, I didn't solve this problem.
We switched to BERT (and pytorch)...
So, I have a workaround but it is somewhat impractical. Obtain the vocabulary of your dataset, then create an embedding file similar to a word2vec or glove file (ie. word 1024-dim-weights per line). And then implement an custom weight embedding layer, it worked ok for Spanish.
@bazzmx do you have a code snippet?
This is based on this blog post
First generate a list of unique words in your vocabulary and obtain their corresponding elmo embbedings using elmoformanylangs using sent2elmo and save to a file that contains one word and its 1024_dim_weights per line. In this example this file is emb_table.txt and the words that compose my vocabulary are the set of lemmas and words that will be used during training and testing
vocab = list(set(list(words_train)+list(lemmas_train))) # unique words in your vocabulary
enumerated_vocab = enumerate(sorted(vocab), 1) # indexed words
index_labels = {} # index_no:word/label dict
for i in enumerated_vocab:
index_labels[i[0]]=i[1] # appends words and indexes to dict
labels_index = {v:k for k,v in index_labels.items()} # reversed dict = word/labels:index
With these dictionaries then you create the index to embedding matrix. Now that you are working with indexes you have to remember to convert all your words (string) to these indexes (int32).
embeddings_index = {}
f = open("./emb_table.txt", encoding="utf8") # opens generated elmo embeddings file
for line in f:
values = line.split() # splits each line
word = values[0] # first value is the word entry
coefs = np.asarray(values[1:], dtype='float32') # the rest is converted to an embedding array
embeddings_index[word] = coefs # appends data to dict
f.close()
Now we generate an embedding matrix putting all the pieces together:
EMBEDDING_DIM = 1024 # elmo's default size
embedding_matrix = np.zeros((len(vocab) + 1, EMBEDDING_DIM)) # creates a nx1024 matrix
for word, i in labels_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
With all this elements then you add an input layer and an embedding layer to your model, the input will be int32 because you have to convert each word to its index, then the embedding layer will assign the corresponding weights to each index:
input_words = Input(shape=(max_len,), dtype="int32", name="input_words")
EMB = Embedding(len(vocab) + 1,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=(max_len,),
trainable=False, name="embedding")
emb_text = EMB(input_words)
Your input words should be an array of indexes that you can convert again to words using the dictionaries that were created.
This is an impractical workaround but it does the job for now, the key limitations are that you don't have acces to the embeddings in the same way that you would using tfhub.
I tried using the lambda layer approach but I ended up getting errors related to tensors and map_fn, etc,
Anyone was able to make text classification in Keras successfully with sentence vectors?