gcsfs icon indicating copy to clipboard operation
gcsfs copied to clipboard

Failed to Establish New Connection Error on GKE

Open bgoodman44 opened this issue 5 years ago • 4 comments

I'm using gcfs version 0.6.0, and everything was working fine yesterday (so it might be a GKE issue). I'm on a GKE cluster, so my login credentials are already available, and I'm not explicitly providing a token. I've tried the "browser" method as well, with the same result.

Because everything was working yesterday, and it works on my local machine, I'm guessing there is an issue on GKE/GCS, and not with gcsfs...but just in case, I'll raise the issue here as well.

When I "ls" the bucket locally, everything works as expected. When I try on cluster I get the following error:

_call out of retries on exception: HTTPSConnectionPool(host='www.googleapis.com', port=443): Max retries exceeded with url: /storage/v1/b/intellifin.net/o/?delimiter=%2F&prefix=MRKT_DATA%2F (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f56982f1e80>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.6/dist-packages/urllib3/util/connection.py", line 61, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 300, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 169, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f56982f1e80>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 436, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.googleapis.com', port=443): Max retries exceeded with url: /storage/v1/b/intellifin.net/o/?delimiter=%2F&prefix=MRKT_DATA%2F (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f56982f1e80>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/gcsfs/core.py", line 534, in _call
    timeout=self.requests_timeout,
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.googleapis.com', port=443): Max retries exceeded with url: /storage/v1/b/intellifin.net/o/?delimiter=%2F&prefix=MRKT_DATA%2F (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f56982f1e80>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

bgoodman44 avatar Mar 03 '20 11:03 bgoodman44

socket.gaierror: [Errno -3] Temporary failure in name resolution

Looks like a genuine and hopefully temporary network issue. Are you using the "cloud" method to authenticate on GKE? You might need to enable API endpoints in your VPC.

martindurant avatar Mar 03 '20 14:03 martindurant

I am using token=None, which I believe falls into the "cloud" method for authentication?

Okay, so I guess I lied a little when I said (or implied) I was using the exact same configuration...the difference was I added a node taint to the master node when it stopped working. Not sure why this would cause gcsfs to stop working?

This works for gcsfs:

gcloud container clusters create --region us-central1 --num-nodes 1 --machine-type n1-standard-2 --cluster-version latest my-cluster

This yields the gcsf error described in the first post:

gcloud container clusters create --region us-central1 --num-nodes 1 --node-taints masterkey=mastereval:NoSchedule --machine-type n1-standard-2 --cluster-version latest kube-opt-cluster

Any ideas?

bgoodman44 avatar Mar 04 '20 11:03 bgoodman44

I have no concrete idea, but you should probably check what set of pods/services become unschedulable because of that taint. I daresay gcsfs/gauth is trying to connect to the metadata service, but it is not available in the opt version of the cluster.

martindurant avatar Mar 04 '20 14:03 martindurant

I wonder if it's because I'm tainting the master node? I'll create a separate tainted pool for my pod and report back

bgoodman44 avatar Mar 05 '20 13:03 bgoodman44