code-intelligence icon indicating copy to clipboard operation
code-intelligence copied to clipboard

FailedPreconditionError op not initialized

Open jlewi opened this issue 5 years ago • 15 comments

From #70; I'm observing the following errors when running the inference model in pubsub workers.

The first couple of predictions succeed but then it starts failing.

This looks like a threading issue. The first successful predictions happen in one thread and the failed predictions happen in another thread. I logged the thread number to confirm this.

Not sure why we didn't observe this in the original code or what's different about my code https://github.com/machine-learning-apps/Issue-Label-Bot/blob/master/flask_app/utils.py

   Traceback (most recent call last):
    File "/py/label_microservice/worker.py", line 145, in callback
      predictions = self._predictor.predict(data)
    File "/py/label_microservice/issue_label_predictor.py", line 152, in predict
      model_name=data.get("model_name"))
    File "/py/label_microservice/issue_label_predictor.py", line 114, in predict_labels_for_issue
      model_name, data.get("title"), data.get("body"))
    File "/py/label_microservice/issue_label_predictor.py", line 74, in predict_labels_for_data
      predictions = model.predict_issue_labels(title, body)
    File "/py/label_microservice/combined_model.py", line 34, in predict_issue_labels
      latest = m.predict_issue_labels(title, text)
    File "/py/label_microservice/universal_kind_label_model.py", line 84, in predict_issue_labels
      probs = self.model.predict(x=[vec_body, vec_title]).tolist()[0]
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 908, in predict
      use_multiprocessing=use_multiprocessing)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 723, in predict
      callbacks=callbacks)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 394, in model_iteration
      batch_outs = f(ins_batch)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3476, in __call__
      run_metadata=self.run_metadata)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
      run_metadata_ptr)
    tensorflow.python.framework.errors_impl.FailedPreconditionError: Error while reading resource variable dense_5/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/dense_5/bias/N10tensorflow3VarE does not exist.

jlewi avatar Jan 03 '20 19:01 jlewi

Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.89. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] avatar Jan 03 '20 19:01 issue-label-bot[bot]

Ref: keras-team/keras#5640

jlewi avatar Jan 03 '20 19:01 jlewi

It looks like doing the following might fix it

  with self._graph.as_default() as graph:
      with tf.Session(graph=graph) as sess:
        init=tf.global_variables_initializer()
        sess.run(init)
        probs = self.model.predict(x=[vec_body, vec_title]).tolist()[0]

jlewi avatar Jan 03 '20 20:01 jlewi

Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.

kf-label-bot-dev[bot] avatar Jan 03 '20 20:01 kf-label-bot-dev[bot]

I'm not convinced that actually worked; my suspicion is that the model is no longer loaded and we are using random weights.

jlewi avatar Jan 03 '20 20:01 jlewi

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.89

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

kf-label-bot-dev[bot] avatar Jan 03 '20 20:01 kf-label-bot-dev[bot]

Yeah looks like that wasn't loading the actual weights. As soon as I changed it to load the model on each predict call I started getting much better results.

As a hack just reload the model.

jlewi avatar Jan 03 '20 20:01 jlewi

I encountered these genre of issues when building Issue Label Bot for the first time, feel free to take a look at https://github.com/machine-learning-apps/Issue-Label-Bot/blob/master/flask_app/app.py Incase their is a recipe there that might help

hamelsmu avatar Jan 04 '20 00:01 hamelsmu

Thanks @hamelsmu I had looked at https://github.com/machine-learning-apps/Issue-Label-Bot/blob/master/flask_app/app.py and couldn't figure out what it was doing differently that multi-threading doesn't seem to be an issue.

jlewi avatar Jan 05 '20 19:01 jlewi

@jlewi I think I'm lost with some of the code changes. Can you point me to the flask app code that is serving the Label Microsservice? I can't seem to find it anywhere in master?

hamelsmu avatar Jan 06 '20 21:01 hamelsmu

Here is an Architecture Diagram

There are basically two pieces

  • The front-end flask app

    • This publishes items to pubsub for certain repositories (i.e repositories with their own model)
    • It performs inference for the remaining repositories
  • The backend worker microservice

    • This reads items from pubsub and does inference
    • The code is in kubeflow/code-intelligence py/label_microservice
    • This is the code where we are doing inference and hitting the threading issue
    • Here's a link to the specific line https://github.com/kubeflow/code-intelligence/blob/d9c1633a4c098a747a85be00ed9fee1a5cffa605/py/label_microservice/universal_kind_label_model.py#L88

jlewi avatar Jan 06 '20 21:01 jlewi

@jlewi I have an idea how to fix this (I would test it myself, but not sure how to test the microservice):

# import set session
import tensorflow.compat.v1.keras.backend.set_session as set_session
# When you initialize the model
self.session = tf.Session(graph=tf.Graph())
with self.session.graph.as_default():
    set_session(session)
    self.model = keras_models.load_model(model_path)
# When you make the prediction
with self.session.graph.as_default():
    set_session(session)
    self.model.predict(...)

hamelsmu avatar Jan 06 '20 22:01 hamelsmu

Oh and sorry for making you repeat the documentation, I should have just looked there instead 🤦‍♂ my apologies

hamelsmu avatar Jan 06 '20 22:01 hamelsmu

Thanks @hamelsmu if you wanted to try this out; my suggestion would be to follow the developer guide https://github.com/kubeflow/code-intelligence/blob/master/Label_Microservice/developer_guide.md

That should explain how to

  • Use the dev instance of the deployment
  • Use skaffold to quickly sync your locally modified code to code running on the cluster
  • Publish an issue to pubsub to trigger predictions
    • I found that you usually want to submit a bunch of issues in rapid succession to trigger the issue
    • The logs should print out the id of the thread inference occurred in so you can confirm that predictions were handled in different threads.

jlewi avatar Jan 07 '20 00:01 jlewi

ok I will put this on my backlog

hamelsmu avatar Jan 07 '20 00:01 hamelsmu