cloud-sql-proxy icon indicating copy to clipboard operation
cloud-sql-proxy copied to clipboard

Wait for network before starting up

Open bmears-aaxis opened this issue 4 years ago • 17 comments

Question

We seem to be having an issue with the SQL proxy failing to initialize on the first try due to a timing issue with it and the istio sidecar deployed by ASM. Is there a way to delay the SQL proxy from trying to make a connection to cloud SQL so that the istio sidecar can establish itself? Our backup plan is to disable the port for Istio egress for cloud SQL.

bmears-aaxis avatar Sep 28 '21 03:09 bmears-aaxis

When you say failing to initialize, do you mean the initial call to collect instance metadata fails, or are you talking about the first connection attempt failing?

Assuming you're seeing the initial call to collect instance metadata failing, I don't see a great way to delay initialization. I can think of a few things to try though (in order of preference):

  1. If you start the proxy with -fuse using a Unix socket (e.g., cloud_sql_proxy -fuse -dir /tmp/cloudsql), the proxy won't do any initial setup until an application attempts to connect to it.
  2. You could try compiling a custom version of the proxy with a sleep just before this line: https://github.com/GoogleCloudPlatform/cloudsql-proxy/blob/22080999a649eb110513523b889ded2596313970/cmd/cloud_sql_proxy/cloud_sql_proxy.go#L526 Granted that's a hack, but if that works for you, we could discuss a feature to wait on some external event before proceeding with initialization.

enocom avatar Sep 28 '21 15:09 enocom

"When you say failing to initialize, do you mean the initial call to collect instance metadata fails, or are you talking about the first connection attempt failing?" I'm not sure about the specific details. Whatever the initial egress call from the proxy container is is failing. At this point the container is toast and has to re-initialize. We were considering #2 as another option, but would prefer not to have to manage our own container. Let me try your other suggestion. Thanks.

bmears-aaxis avatar Sep 28 '21 16:09 bmears-aaxis

Just a note, since these are both google authorized features, I assume this will come up again with other clients. Maybe a discussion with the ASM team to come up with a solid approach to handle this condition is in order?

bmears-aaxis avatar Sep 28 '21 16:09 bmears-aaxis

Yes. In any case, I suspect there's a general need to sometimes delay the proxy at startup.

enocom avatar Sep 28 '21 16:09 enocom

Sounds good. Thanks again.

bmears-aaxis avatar Sep 28 '21 16:09 bmears-aaxis

Here's additional details:

*url.Error Get "http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/scopes": dial tcp 169.254.169.254:80: connect: connection refused | Get "http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/scopes": dial tcp 169.254.169.254:80: connect: connection refused

Trying to get SA details for the account used by the proxy. Obviously, adding a port override for istio on 80 would not be a good idea! :)

bmears-aaxis avatar Sep 28 '21 21:09 bmears-aaxis

So just to clarify the issue, you're deploying the proxy as a sidecar along with an Istio sidecar. Before the Istio sidecar starts up, the proxy will fail to run because it can't send network calls. The proxy repeatedly crashes, the network comes up, and then the proxy comes up successfully. Is that what you're seeing?

enocom avatar Sep 30 '21 15:09 enocom

Exactly. We were able to get around the issue with IP filtering on the istio proxy, disabling the metadata service call and the subsequent call to the cloud SQL API. The problem is the SQL API does not appear to have a static IP, so there is no way to provide istio with a filtering range. Essentially, these 2 sidecars do not play well together. I'm not sure if there is a way to check for a mesh proxy and ping it's readiness probe before starting any activity with the SQL proxy? It might make more sense than adding a delay. BTW, I tried using the fuse flag to see if that would squash the initial SQL API call and it hosed the container. I couldn't even get to the logs to see what was going on.

bmears-aaxis avatar Sep 30 '21 15:09 bmears-aaxis

Screen Shot 2021-09-30 at 8 49 53 AM

It's really mostly an annoyance cause the proxy recovers fairly quickly. You can see the restarts here.

bmears-aaxis avatar Sep 30 '21 15:09 bmears-aaxis

OK. That helps. I agree -- there's probably a way to check for network connectivity before continuing initialization.

enocom avatar Sep 30 '21 16:09 enocom

Looks like this is common problem: https://github.com/istio/istio/issues/11130.

enocom avatar Sep 30 '21 16:09 enocom

Possible fix for Istio users: https://github.com/istio/istio/issues/11130#issuecomment-707642638.

In short (available in Istio 1.7 and later):

annotations:
  proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }'

enocom avatar Sep 30 '21 16:09 enocom

I guess the question now is, what version do we have? :) That looks promising though.

bmears-aaxis avatar Sep 30 '21 18:09 bmears-aaxis

This does feel like an issue with Istio, but if we have to, we can find a way to work around it. Would you mind reporting back if that annotation solves the issue for you?

enocom avatar Sep 30 '21 19:09 enocom

Most definently. I appreciate your assistance.

bmears-aaxis avatar Sep 30 '21 21:09 bmears-aaxis

@bmears-aaxis any updates?

enocom avatar Dec 02 '21 16:12 enocom

Related to #348.

enocom avatar Dec 02 '21 17:12 enocom

Going to close this given it's a Kubernetes problem and hasn't got much attention.

enocom avatar Mar 17 '23 02:03 enocom