code-intelligence
code-intelligence copied to clipboard
FailedPreconditionError op not initialized
From #70; I'm observing the following errors when running the inference model in pubsub workers.
The first couple of predictions succeed but then it starts failing.
This looks like a threading issue. The first successful predictions happen in one thread and the failed predictions happen in another thread. I logged the thread number to confirm this.
Not sure why we didn't observe this in the original code or what's different about my code https://github.com/machine-learning-apps/Issue-Label-Bot/blob/master/flask_app/utils.py
Traceback (most recent call last):
File "/py/label_microservice/worker.py", line 145, in callback
predictions = self._predictor.predict(data)
File "/py/label_microservice/issue_label_predictor.py", line 152, in predict
model_name=data.get("model_name"))
File "/py/label_microservice/issue_label_predictor.py", line 114, in predict_labels_for_issue
model_name, data.get("title"), data.get("body"))
File "/py/label_microservice/issue_label_predictor.py", line 74, in predict_labels_for_data
predictions = model.predict_issue_labels(title, body)
File "/py/label_microservice/combined_model.py", line 34, in predict_issue_labels
latest = m.predict_issue_labels(title, text)
File "/py/label_microservice/universal_kind_label_model.py", line 84, in predict_issue_labels
probs = self.model.predict(x=[vec_body, vec_title]).tolist()[0]
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 908, in predict
use_multiprocessing=use_multiprocessing)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 723, in predict
callbacks=callbacks)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 394, in model_iteration
batch_outs = f(ins_batch)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3476, in __call__
run_metadata=self.run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
run_metadata_ptr)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Error while reading resource variable dense_5/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/dense_5/bias/N10tensorflow3VarE does not exist.
Issue-Label Bot is automatically applying the label kind/bug
to this issue, with a confidence of 0.89. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!
Links: app homepage, dashboard and code for this bot.
Ref: keras-team/keras#5640
It looks like doing the following might fix it
with self._graph.as_default() as graph:
with tf.Session(graph=graph) as sess:
init=tf.global_variables_initializer()
sess.run(init)
probs = self.model.predict(x=[vec_body, vec_title]).tolist()[0]
Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.
I'm not convinced that actually worked; my suspicion is that the model is no longer loaded and we are using random weights.
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
kind/bug | 0.89 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
Yeah looks like that wasn't loading the actual weights. As soon as I changed it to load the model on each predict call I started getting much better results.
As a hack just reload the model.
I encountered these genre of issues when building Issue Label Bot for the first time, feel free to take a look at https://github.com/machine-learning-apps/Issue-Label-Bot/blob/master/flask_app/app.py Incase their is a recipe there that might help
Thanks @hamelsmu I had looked at https://github.com/machine-learning-apps/Issue-Label-Bot/blob/master/flask_app/app.py and couldn't figure out what it was doing differently that multi-threading doesn't seem to be an issue.
@jlewi I think I'm lost with some of the code changes. Can you point me to the flask app code that is serving the Label Microsservice? I can't seem to find it anywhere in master?
Here is an Architecture Diagram
There are basically two pieces
-
- This publishes items to pubsub for certain repositories (i.e repositories with their own model)
- It performs inference for the remaining repositories
-
The backend worker microservice
- This reads items from pubsub and does inference
- The code is in kubeflow/code-intelligence py/label_microservice
- This is the code where we are doing inference and hitting the threading issue
- Here's a link to the specific line https://github.com/kubeflow/code-intelligence/blob/d9c1633a4c098a747a85be00ed9fee1a5cffa605/py/label_microservice/universal_kind_label_model.py#L88
@jlewi I have an idea how to fix this (I would test it myself, but not sure how to test the microservice):
# import set session
import tensorflow.compat.v1.keras.backend.set_session as set_session
# When you initialize the model
self.session = tf.Session(graph=tf.Graph())
with self.session.graph.as_default():
set_session(session)
self.model = keras_models.load_model(model_path)
# When you make the prediction
with self.session.graph.as_default():
set_session(session)
self.model.predict(...)
Oh and sorry for making you repeat the documentation, I should have just looked there instead 🤦♂ my apologies
Thanks @hamelsmu if you wanted to try this out; my suggestion would be to follow the developer guide https://github.com/kubeflow/code-intelligence/blob/master/Label_Microservice/developer_guide.md
That should explain how to
- Use the dev instance of the deployment
- Use skaffold to quickly sync your locally modified code to code running on the cluster
- Publish an issue to pubsub to trigger predictions
- I found that you usually want to submit a bunch of issues in rapid succession to trigger the issue
- The logs should print out the id of the thread inference occurred in so you can confirm that predictions were handled in different threads.
ok I will put this on my backlog