community icon indicating copy to clipboard operation
community copied to clipboard

Issues with tutorials/automl-tables-model-export

Open raerevvo opened this issue 4 years ago • 18 comments

Hi!

Thank you for creating the tutorial "Export a custom AutoML Tables model and serve it with Cloud Run", it has been super helpful! However I am receiving this error when attempting to retrieve predictions from Cloud Run: {"error": "failed to connect to all addresses"}.

Steps taken (followed the tutorial from the beginning, up to the Docker upload):

  1. Built the docker image
  2. Verified that the path in the Dockerfile is correct
  3. Deployed this docker image locally and verified we are able to obtain predictions from our model
  4. Deploying the same docker image to a Cloud Run function results in an error when attempting to curl a. We are providing an Authorization header with a valid token b. We checked that my user's IAM permissions have admin access on the Cloud Run permissions c. We disabled authentication on the Cloud Run function (just in case)

Extra Notes:

  1. On locally running the docker image, we obtained predictions in 15-30s, however on the Cloud Run image we serve a 200 response within 7 seconds.
  2. We executed the curl request within a VM on Google compute in addition to running the curl command locally; both result in the same response from Cloud Run.
  3. The Cloud Run logs are as follows:
Default: 2020-12-22 17:50:30.387 PST2020-12-23 01:50:30.387078: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: default version: 1}
Default: 2020-12-22 17:50:31.686 PST2020-12-23 01:50:31.486896: I tensorflow_serving/model_servers/server.cc:353] Running gRPC ModelServer at 0.0.0.0:8500 ...
Default: 2020-12-22 17:50:32.502 PST2020-12-23 01:50:32.486720: I tensorflow_serving/model_servers/server.cc:373] Exporting HTTP/REST API at:localhost:8501 ...
Default: 2020-12-22 17:50:32.586 PST[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...
Default: 2020-12-22 17:52:21.662 PSTINFO:tornado.access:200 POST /predict (XXX.XXX.XXX.XXX) 8.73ms
Info: 2020-12-22 17:52:21.662 PSTPOST200 446 B 13 ms curl/7.64.1  https://xxxx.a.run.app/predict
Default: 2020-12-22 17:53:40.716 PSTINFO:tornado.access:200 POST /predict (XXX.XXX.XXX.XXX) 8.87ms

The Cloud Run logs seem to indicate that the query was successfully served via the 200 message, but the curl command shows {"error": "failed to connect to all addresses"}.

The Dockerfile is using this base image: gcr.io/cloud-automl-tables-public/model_server My local docker version is: 20.10.0, build 7287ab3 Cloud Run Settings: 4GB, 2 cores

Any help would be appreciated or pointers to how to debug this issue.

raerevvo avatar Dec 23 '20 02:12 raerevvo

@amygdala , could you have a look at this report about a document that you submitted?

ToddKopriva avatar Dec 23 '20 17:12 ToddKopriva

Yes, will do (there may be a few days’ delay due to the holidays). I wonder if something has changed about the Cloud Run auth config. @raerevvo , did you happen to test whether you get this same issue with any other Cloud Run deployments?

amygdala avatar Dec 23 '20 22:12 amygdala

Apologies for the delay with the holidays. Yes @amygdala I have tested a hello world Cloud Run deployment, outlined in this guide. I was able to deploy this and execute a curl GET request with the correct response received. Thanks!

raerevvo avatar Dec 30 '20 22:12 raerevvo

@raerevvo -- To keep you posted: the gcr.io/cloud-automl-tables-public/model_server image has changed since I wrote that tutorial, and the issues are related in some way to how the new image works.

More specifically: With a local run of the new container image, I see log messages like this, that did not occur with the older model_server image:

INFO:root:connectivity went from ChannelConnectivity.IDLE to ChannelConnectivity.CONNECTING
INFO:root:connectivity went from ChannelConnectivity.CONNECTING to ChannelConnectivity.TRANSIENT_FAILURE

then a bit later: INFO:root:connectivity went from ChannelConnectivity.TRANSIENT_FAILURE to ChannelConnectivity.READY

However, when deployed to Cloud Run, the logs show the first two messages repeated over and over (connecting, then transient failure) but it never reaches the READY state.

INFO:root:connectivity went from ChannelConnectivity.IDLE to ChannelConnectivity.CONNECTING
INFO:root:connectivity went from ChannelConnectivity.CONNECTING to ChannelConnectivity.TRANSIENT_FAILURE
INFO:root:connectivity went from ChannelConnectivity. TRANSIENT_FAILURE to ChannelConnectivity.CONNECTING
INFO:root:connectivity went from ChannelConnectivity.CONNECTING to ChannelConnectivity.TRANSIENT_FAILURE

… etc

And, because that READY state is not reached, prediction requests sent to the Cloud Run instance fail with the error you saw. Something about that 'connection' process (which I'm guessing is grpc-related) is not working in the Cloud Run env where it works locally. I've pinged the person who creates these container images to find out more, so hopefully should be able to get back to you with something more constructive soon. I'm wondering if some additional network config is required.

(I double checked that a build using an older model server image— which does not log the messages above— still works on Cloud Run).

amygdala avatar Jan 05 '21 23:01 amygdala

Btw -- in the process I noticed that there is a typo in the "Build a container to use for Cloud Run" section -- for the Dockerfile, ADD model-export/tbl/[YOUR_RENAMED_DIRECTORY]/models/default/0000001 should be ADD model-export/tbl/[YOUR_RENAMED_DIRECTORY] /models/default/0000001

It sounds like you figured that out though. I'll put in a PR to fix.

amygdala avatar Jan 05 '21 23:01 amygdala

Thanks, @amygdala .

ToddKopriva avatar Jan 05 '21 23:01 ToddKopriva

I've been having this issue for the last half a year. It looks like it stabilizes (this transient error loop) and starts serving the model. But the response is as mentioned by @raerevvo .

Thanks for bringing this up

dinigo avatar Jan 07 '21 14:01 dinigo

Thanks for the additional repro. Interesting, you are right -- after about 10 mins the loop does seem to stabilize, though not persistently, as another prediction request a few mins after the successful one gave the error again. We have an internal bug filed to look into this. (It won't work to use the older base image for current models as there have been some changes).

cc @helinwang as FYI

amygdala avatar Jan 07 '21 19:01 amygdala

Interestingly, when I deployed a cloud run version set to always have 1 instance running, then let this deployment sit for 30 mins, then tried it, prediction seems to work reliably (so far). The min instance config can be accomplished via a gcloud sdk beta feature, e.g.: gcloud beta run deploy --min-instances 1 ...

So a temp workaround may be a combination of 1) let the deployment sit for 'a while' before using; and 2) configure the deployment so that at least one instance is always running.

amygdala avatar Jan 07 '21 21:01 amygdala

I was actually using this beta feature too. But it failed to serve results. In my case it returns this "failed to connect to all addresses" JSON error response.

I will try again in case they have updated something in the image. But the image itself is kind of obscure. They don't provide a changelog nor an explanation of "why can't I load this model as I normally would with any other TF model"

dinigo avatar Jan 09 '21 19:01 dinigo

@amygdala / @dinigo / @ToddKopriva
Hey GCloud guys, any update on this error? I am having exactly the same issue when locally run the exported AutoML model in docker.

So I bring up the docker and try curl which got "failed to connect to all addresses"

ec2-user@ip-172-31-21-172 AutoML]$ curl -X POST --data @request.json http://localhost:8080/predict
{"error": "failed to connect to all addresses"}

In docker log: I got this, seems the request reached and got 200, but the curl failed

INFO:root:connectivity went from ChannelConnectivity.TRANSIENT_FAILURE to ChannelConnectivity.CONNECTING
INFO:root:connectivity went from ChannelConnectivity.CONNECTING to ChannelConnectivity.TRANSIENT_FAILURE
INFO:root:connectivity went from ChannelConnectivity.TRANSIENT_FAILURE to ChannelConnectivity.CONNECTING
INFO:root:connectivity went from ChannelConnectivity.CONNECTING to ChannelConnectivity.TRANSIENT_FAILURE
INFO:root:connectivity went from ChannelConnectivity.TRANSIENT_FAILURE to ChannelConnectivity.CONNECTING
INFO:root:connectivity went from ChannelConnectivity.CONNECTING to ChannelConnectivity.TRANSIENT_FAILURE
INFO:root:connectivity went from ChannelConnectivity.TRANSIENT_FAILURE to ChannelConnectivity.CONNECTING
INFO:root:connectivity went from ChannelConnectivity.CONNECTING to ChannelConnectivity.TRANSIENT_FAILURE
INFO:root:connectivity went from ChannelConnectivity.TRANSIENT_FAILURE to ChannelConnectivity.CONNECTING
INFO:root:connectivity went from ChannelConnectivity.CONNECTING to ChannelConnectivity.TRANSIENT_FAILURE
INFO:tornado.access:200 POST /predict (172.17.0.1) 6.35ms
INFO:tornado.access:200 POST /predict (172.17.0.1) 1.29ms
INFO:root:connectivity went from ChannelConnectivity.TRANSIENT_FAILURE to ChannelConnectivity.CONNECTING
INFO:root:connectivity went from ChannelConnectivity.CONNECTING to ChannelConnectivity.TRANSIENT_FAILURE

I start docker with sudo docker run -v pwd/model_1/predict/001:/models/default/0000001 -p 8080:8080 -it gcr.io/cloud-automl-tables-public/model_server

jackyzhujiale avatar Feb 04 '22 16:02 jackyzhujiale

btw I'm trying with Vertex AI regression model, and use the container export. The model can work successfully in Cloud console batch prediction. The case only happen locally, and it won't self recover to REady state even waited for hours

jackyzhujiale avatar Feb 04 '22 16:02 jackyzhujiale

Here's the docker starting log:

[ec2-user@ip-172-31-21-172 AutoML]$ sudo docker run -v `pwd`/model_1/predict/001:/models/default/0000001 -p 8080:8080 -it gcr.io/cloud-automl-tables-public/model_server 
INFO:root:running model server
2022-02-04 16:40:21.089343: I tensorflow_serving/model_servers/server.cc:85] Building single TensorFlow model file config:  model_name: default model_base_path: /models/default
2022-02-04 16:40:21.090224: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
2022-02-04 16:40:21.090315: I tensorflow_serving/model_servers/server_core.cc:573]  (Re-)adding model: default
2022-02-04 16:40:21.190633: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: default version: 1}
2022-02-04 16:40:21.191179: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: default version: 1}
2022-02-04 16:40:21.191795: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: default version: 1}
2022-02-04 16:40:21.193187: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/default/0000001
2022-02-04 16:40:21.206710: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2022-02-04 16:40:21.225026: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2022-02-04 16:40:21.256630: E external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1575] OpKernel ('op: "DecodeProtoSparse" device_type: "CPU"') for unknown op: DecodeProtoSparse
2022-02-04 16:40:21.302428: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:202] Restoring SavedModel bundle.
2022-02-04 16:40:21.504970: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:151] Running initialization op on SavedModel bundle at path: /models/default/0000001
2022-02-04 16:40:21.582055: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:311] SavedModel load for tags { serve }; Status: success. Took 389461 microseconds.
2022-02-04 16:40:21.590440: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:117] Starting to read warmup data for model at /models/default/0000001/assets.extra/tf_serving_warmup_requests with model-warmup-options 
2022-02-04 16:40:22.190581: F external/org_tensorflow/tensorflow/core/framework/tensor_shape.cc:44] Check failed: NDIMS == dims() (1 vs. 2)Asking for tensor of 1 dimensions from a tensor of 2 dimensions
Aborted (core dumped)
INFO:root:connecting to TF serving at localhost:8500
INFO:root:connectivity went from None to ChannelConnectivity.IDLE
INFO:root:connectivity went from ChannelConnectivity.IDLE to ChannelConnectivity.TRANSIENT_FAILURE
INFO:root:server listening on port 8080
INFO:root:connectivity went from ChannelConnectivity.TRANSIENT_FAILURE to ChannelConnectivity.CONNECTING
INFO:root:connectivity went from ChannelConnectivity.CONNECTING to ChannelConnectivity.TRANSIENT_FAILURE
INFO:root:connectivity went from ChannelConnectivity.TRANSIENT_FAILURE to ChannelConnectivity.CONNECTING
INFO:root:connectivity went from ChannelConnectivity.CONNECTING to ChannelConnectivity.TRANSIENT_FAILURE

jackyzhujiale avatar Feb 04 '22 16:02 jackyzhujiale

If there is a environment.json file in the model artifact, can you use the docker image URI in the environment.json file, instead of gcr.io/cloud-automl-tables-public/model_server? That should fix the problem.

helinwang avatar Feb 04 '22 17:02 helinwang

That works!

I think this work around should be updated in https://cloud.google.com/automl-tables/docs/model-export#export It costs me half day to finally reach to this github issue page

jackyzhujiale avatar Feb 04 '22 17:02 jackyzhujiale

Yes, absolutely, sorry about the inconvenience. I will be working on it. Btw, I think you are using Vertex, https://cloud.google.com/vertex-ai/docs/export/export-model-tabular is the right documentation page :)

helinwang avatar Feb 04 '22 17:02 helinwang

Oh I see, the document for Vertex is correct. It's easy to confuse given automl-tables is so similar comparing to vertex-ai

jackyzhujiale avatar Feb 04 '22 17:02 jackyzhujiale

I followed the instructions at https://cloud.google.com/vertex-ai/docs/export/export-model-tabular and I ran into the same error as https://github.com/GoogleCloudPlatform/community/issues/1556#issuecomment-1030157037. I was using the Docker container image us-docker.pkg.dev/vertex-ai/automl-tabular/prediction-server-v1.

However, I am to use the container_uri (us-docker.pkg.dev/vertex-ai/automl-tabular/prediction-server:20220616_1125) specified in environment.json and deploy my AutoML model for classification on my local environment.

There is clearly a mismatch between the documentation and the edge deployment experience. For record, it took me 2 hours to research until I found this GitHub issue which offered the workaround.

kawofong avatar Jul 06 '22 19:07 kawofong

This tutorial has been archived: https://github.com/GoogleCloudPlatform/community/pull/2269

ToddKopriva avatar Nov 07 '22 23:11 ToddKopriva