Error training a custom model
Hello, I am writing because I am trying to train a custom model for DeepMicrobes, and keep getting the same error whenever I try to train the model on the TFrecord I have created.
The stack trace I get is very long, but I believe the key issue is this:
Traceback (most recent call last):
File ~\anaconda3\lib\site-packages\tensorflow\python\client\session.py:1378 in _do_call
return fn(*args)
File ~\anaconda3\lib\site-packages\tensorflow\python\client\session.py:1361 in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File ~\anaconda3\lib\site-packages\tensorflow\python\client\session.py:1454 in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
InvalidArgumentError: indices[12,53] = 526337 is not in [0, 526337)
[[{{node token_embedding/embedding_lookup}}]]
526337 happens to be exactly the size of the vocabulary file I am using, and it is somehow getting an index out of bounds error on its lookup. How could the embedding of a DNA read use a value that's not in the vocabulary?
I have tried this both using the properly installed version of DeepMicrobes with TensorFlow 1.9, and also by porting the code to TensorFlow 2 myself, but both versions get the same error, with the only thing changing with every run being the indices[xx, yy] location at which it goes out of bounds.
Are there any reasons for why this might be happening?
Hi, how did you set the dimensions for the embedding layer and how many words are in your vocabulary file (including a word for out-of-vocabulary words)?
Hello,
I fixed it by manually clipping the input tensor to where values over the vocabulary size are set to 0 (the ID of the <unk> symbol)
All I did was change the embedding_layer function in custom_layers.py as such:
def embedding_layer(inputs, vocab_size, embedding_dim, initializer):
"""Looks up embedding vectors for each k-mer."""
#Need to clip tensor so all values > vocab_size are put to 0
inputs = tf.where(tf.math.greater_equal(inputs, tf.constant([vocab_size], dtype=tf.int64)), tf.zeros_like(inputs), inputs)
embedding_weights = tf.compat.v1.get_variable(name="token_embedding_weights",
shape=[vocab_size, embedding_dim],
initializer=initializer, trainable=True)
return tf.compat.v1.nn.embedding_lookup(embedding_weights, inputs)
Nice work! I have no idea why the ID could exceed the vocabulary size. Sorry about that.