code-intelligence FailedPreconditionError op not initialized

trafficstars

From #70; I'm observing the following errors when running the inference model in pubsub workers.

The first couple of predictions succeed but then it starts failing.

This looks like a threading issue. The first successful predictions happen in one thread and the failed predictions happen in another thread. I logged the thread number to confirm this.

Not sure why we didn't observe this in the original code or what's different about my code https://github.com/machine-learning-apps/Issue-Label-Bot/blob/master/flask_app/utils.py

   Traceback (most recent call last):
    File "/py/label_microservice/worker.py", line 145, in callback
      predictions = self._predictor.predict(data)
    File "/py/label_microservice/issue_label_predictor.py", line 152, in predict
      model_name=data.get("model_name"))
    File "/py/label_microservice/issue_label_predictor.py", line 114, in predict_labels_for_issue
      model_name, data.get("title"), data.get("body"))
    File "/py/label_microservice/issue_label_predictor.py", line 74, in predict_labels_for_data
      predictions = model.predict_issue_labels(title, body)
    File "/py/label_microservice/combined_model.py", line 34, in predict_issue_labels
      latest = m.predict_issue_labels(title, text)
    File "/py/label_microservice/universal_kind_label_model.py", line 84, in predict_issue_labels
      probs = self.model.predict(x=[vec_body, vec_title]).tolist()[0]
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 908, in predict
      use_multiprocessing=use_multiprocessing)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 723, in predict
      callbacks=callbacks)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 394, in model_iteration
      batch_outs = f(ins_batch)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3476, in __call__
      run_metadata=self.run_metadata)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
      run_metadata_ptr)
    tensorflow.python.framework.errors_impl.FailedPreconditionError: Error while reading resource variable dense_5/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/dense_5/bias/N10tensorflow3VarE does not exist.

Jan 03 '20 19:01 jlewi

Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.89. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

Jan 03 '20 19:01 issue-label-bot[bot]

Ref: keras-team/keras#5640

Jan 03 '20 19:01 jlewi

It looks like doing the following might fix it

  with self._graph.as_default() as graph:
      with tf.Session(graph=graph) as sess:
        init=tf.global_variables_initializer()
        sess.run(init)
        probs = self.model.predict(x=[vec_body, vec_title]).tolist()[0]

Jan 03 '20 20:01 jlewi

Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.

Jan 03 '20 20:01 kf-label-bot-dev[bot]

I'm not convinced that actually worked; my suspicion is that the model is no longer loaded and we are using random weights.

Jan 03 '20 20:01 jlewi

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/bug	0.89

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

Jan 03 '20 20:01 kf-label-bot-dev[bot]

Yeah looks like that wasn't loading the actual weights. As soon as I changed it to load the model on each predict call I started getting much better results.

As a hack just reload the model.

Jan 03 '20 20:01 jlewi

I encountered these genre of issues when building Issue Label Bot for the first time, feel free to take a look at https://github.com/machine-learning-apps/Issue-Label-Bot/blob/master/flask_app/app.py Incase their is a recipe there that might help

Jan 04 '20 00:01 hamelsmu

Thanks @hamelsmu I had looked at https://github.com/machine-learning-apps/Issue-Label-Bot/blob/master/flask_app/app.py and couldn't figure out what it was doing differently that multi-threading doesn't seem to be an issue.

Jan 05 '20 19:01 jlewi

@jlewi I think I'm lost with some of the code changes. Can you point me to the flask app code that is serving the Label Microsservice? I can't seem to find it anywhere in master?

Jan 06 '20 21:01 hamelsmu

Here is an Architecture Diagram

There are basically two pieces

The front-end flask app
- This publishes items to pubsub for certain repositories (i.e repositories with their own model)
- It performs inference for the remaining repositories
The backend worker microservice
- This reads items from pubsub and does inference
- The code is in kubeflow/code-intelligence py/label_microservice
- This is the code where we are doing inference and hitting the threading issue
- Here's a link to the specific line https://github.com/kubeflow/code-intelligence/blob/d9c1633a4c098a747a85be00ed9fee1a5cffa605/py/label_microservice/universal_kind_label_model.py#L88

Jan 06 '20 21:01 jlewi

@jlewi I have an idea how to fix this (I would test it myself, but not sure how to test the microservice):

# import set session
import tensorflow.compat.v1.keras.backend.set_session as set_session

# When you initialize the model
self.session = tf.Session(graph=tf.Graph())
with self.session.graph.as_default():
    set_session(session)
    self.model = keras_models.load_model(model_path)

# When you make the prediction
with self.session.graph.as_default():
    set_session(session)
    self.model.predict(...)

Jan 06 '20 22:01 hamelsmu

Oh and sorry for making you repeat the documentation, I should have just looked there instead 🤦‍♂ my apologies

Jan 06 '20 22:01 hamelsmu

Thanks @hamelsmu if you wanted to try this out; my suggestion would be to follow the developer guide https://github.com/kubeflow/code-intelligence/blob/master/Label_Microservice/developer_guide.md

That should explain how to

Use the dev instance of the deployment
Use skaffold to quickly sync your locally modified code to code running on the cluster
Publish an issue to pubsub to trigger predictions
- I found that you usually want to submit a bunch of issues in rapid succession to trigger the issue
- The logs should print out the id of the thread inference occurred in so you can confirm that predictions were handled in different threads.

Jan 07 '20 00:01 jlewi

ok I will put this on my backlog

Jan 07 '20 00:01 hamelsmu

code-intelligence code-intelligence copied to clipboard

FailedPreconditionError op not initialized

code-intelligence
code-intelligence copied to clipboard