earthaccess icon indicating copy to clipboard operation
earthaccess copied to clipboard

Allow custom proxy settings with requests sessions

Open maawoo opened this issue 1 year ago • 7 comments

I'm trying to download GEDI data on my university's HPC system. The following sample code results in a ConnectionError:

results = earthaccess.search_data(
    short_name='GEDI02_A',
    bounding_box=(31.52,-25.08,31.64,-24.99),
    temporal=("2019-01-01", "2024-01-01"),
    count=-1
)
ConnectionError: HTTPSConnectionPool(host='cmr.earthdata.nasa.gov', port=443): Max retries exceeded with url: [/search/granules.umm_json](https://vscode-remote+ssh-002dremote-002bdraco2.vscode-resource.vscode-cdn.net/search/granules.umm_json)?short_name=GEDI02_A&bounding_box=31.52,-25.08,31.64,-24.99&temporal%5B%5D=2019-01-01T00:00:00Z,2024-01-01T00:00:00Z&page_size=0 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f69ef2d40b0>: Failed to establish a new connection: [Errno 111] Connection refused'))

My initial thought was that the API is not whitelisted in our HTTP/HTTPS proxies, which are set via environment variables. However, according to our sysadmin this should not be an issue. I was able to confirm by requesting the same URL via curl:

>> curl "https://cmr.earthdata.nasa.gov/search/granules.umm_json?short_name=GEDI02_A&bounding_box=31.52,-25.08,31.64,-24.99&temporal%5B%5D=2019-01-01T00:00:00Z,2024-01-01T00:00:00Z&page_size=0"
{"hits":92,"took":394,"items":[]}

Any ideas / workarounds would be appreciated!

maawoo avatar Mar 26 '24 13:03 maawoo

It's really hard to say what's going on here without knowing more about your university HPC system. Based on the error, it looks like VSCode is somehow involved?

https://vscode-remote+ssh-002dremote-002bdraco2.vscode-resource.vscode-cdn.net/search/granules.umm_json

Can you provide some more detail on how VSCode is involved in your workflow? host='cmr.earthdata.nasa.gov' indicates that earthaccess is at least attempting to talk to the correct host, and the Requests library seems to agree!

mfisher87 avatar Mar 26 '24 14:03 mfisher87

Hi @mfisher87, I overlooked that, so thanks for pointing it out. However, I still get an error when executing the code outside of VSCode.

Here is the complete error traceback:
---------------------------------------------------------------------------
ConnectionRefusedError                    Traceback (most recent call last)
File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connection.py:203, in HTTPConnection._new_conn(self)
    202 try:
--> 203     sock = connection.create_connection(
    204         (self._dns_host, self.port),
    205         self.timeout,
    206         source_address=self.source_address,
    207         socket_options=self.socket_options,
    208     )
    209 except socket.gaierror as e:

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/connection.py:85, in create_connection(address, timeout, source_address, socket_options)
     84 try:
---> 85     raise err
     86 finally:
     87     # Break explicitly a reference cycle

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/connection.py:73, in create_connection(address, timeout, source_address, socket_options)
     72     sock.bind(source_address)
---> 73 sock.connect(sa)
     74 # Break explicitly a reference cycle

ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

NewConnectionError                        Traceback (most recent call last)
File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connectionpool.py:791, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    790 # Make the request on the HTTPConnection object
--> 791 response = self._make_request(
    792     conn,
    793     method,
    794     url,
    795     timeout=timeout_obj,
    796     body=body,
    797     headers=headers,
    798     chunked=chunked,
    799     retries=retries,
    800     response_conn=response_conn,
    801     preload_content=preload_content,
    802     decode_content=decode_content,
    803     **response_kw,
    804 )
    806 # Everything went great!

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connectionpool.py:492, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    491         new_e = _wrap_proxy_error(new_e, conn.proxy.scheme)
--> 492     raise new_e
    494 # conn.request() calls http.client.*.request, not the method in
    495 # urllib3.request. It also calls makefile (recv) on the socket.

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connectionpool.py:468, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    467 try:
--> 468     self._validate_conn(conn)
    469 except (SocketTimeout, BaseSSLError) as e:

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connectionpool.py:1097, in HTTPSConnectionPool._validate_conn(self, conn)
   1096 if conn.is_closed:
-> 1097     conn.connect()
   1099 if not conn.is_verified:

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connection.py:611, in HTTPSConnection.connect(self)
    610 sock: socket.socket | ssl.SSLSocket
--> 611 self.sock = sock = self._new_conn()
    612 server_hostname: str = self.host

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connection.py:218, in HTTPConnection._new_conn(self)
    217 except OSError as e:
--> 218     raise NewConnectionError(
    219         self, f"Failed to establish a new connection: {e}"
    220     ) from e
    222 # Audit hooks are only available in Python 3.8+

NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f81075a01d0>: Failed to establish a new connection: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

MaxRetryError                             Traceback (most recent call last)
File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/requests/adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    485 try:
--> 486     resp = conn.urlopen(
    487         method=request.method,
    488         url=url,
    489         body=request.body,
    490         headers=request.headers,
    491         redirect=False,
    492         assert_same_host=False,
    493         preload_content=False,
    494         decode_content=False,
    495         retries=self.max_retries,
    496         timeout=timeout,
    497         chunked=chunked,
    498     )
    500 except (ProtocolError, OSError) as err:

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connectionpool.py:845, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    843     new_e = ProtocolError("Connection aborted.", new_e)
--> 845 retries = retries.increment(
    846     method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2]
    847 )
    848 retries.sleep()

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/retry.py:515, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    514     reason = error or ResponseError(cause)
--> 515     raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    517 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)

MaxRetryError: HTTPSConnectionPool(host='cmr.earthdata.nasa.gov', port=443): Max retries exceeded with url: /search/granules.umm_json?short_name=GEDI02_A&bounding_box=31.52,-25.08,31.64,-24.99&temporal%5B%5D=2019-01-01T00:00:00Z,2024-01-01T00:00:00Z&page_size=0 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f81075a01d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
Cell In[3], line 1
----> 1 results = earthaccess.search_data(
      2     short_name='GEDI02_A',
      3     bounding_box=(31.52,-25.08,31.64,-24.99),
      4     temporal=("2019-01-01", "2024-01-01"),
      5     count=-1
      6 )

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/earthaccess/api.py:120, in search_data(count, **kwargs)
    118 else:
    119     query = DataGranules().parameters(**kwargs)
--> 120 granules_found = query.hits()
    121 print(f"Granules found: {granules_found}")
    122 if count > 0:

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/earthaccess/search.py:388, in DataGranules.hits(self)
    379 """Returns the number of hits the current query will return.
    380 This is done by making a lightweight query to CMR and inspecting the returned headers.
    381
    382 Returns:
    383     The number of results reported by CMR.
    384 """
    386 url = self._build_url()
--> 388 response = self.session.get(url, headers=self.headers, params={"page_size": 0})
    390 try:
    391     response.raise_for_status()

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/requests/sessions.py:602, in Session.get(self, url, **kwargs)
    594 r"""Sends a GET request. Returns :class:`Response` object.
    595
    596 :param url: URL for the new :class:`Request` object.
    597 :param \*\*kwargs: Optional arguments that ``request`` takes.
    598 :rtype: requests.Response
    599 """
    601 kwargs.setdefault("allow_redirects", True)
--> 602 return self.request("GET", url, **kwargs)

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    584 send_kwargs = {
    585     "timeout": timeout,
    586     "allow_redirects": allow_redirects,
    587 }
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
    700 start = preferred_clock()
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)
    706 elapsed = preferred_clock() - start

File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/requests/adapters.py:519, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    515     if isinstance(e.reason, _SSLError):
    516         # This branch is for urllib3 v1.22 and later.
    517         raise SSLError(e, request=request)
--> 519     raise ConnectionError(e, request=request)
    521 except ClosedPoolError as e:
    522     raise ConnectionError(e, request=request)

ConnectionError: HTTPSConnectionPool(host='cmr.earthdata.nasa.gov', port=443): Max retries exceeded with url: /search/granules.umm_json?short_name=GEDI02_A&bounding_box=31.52,-25.08,31.64,-24.99&temporal%5B%5D=2019-01-01T00:00:00Z,2024-01-01T00:00:00Z&page_size=0 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f81075a01d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

Same error also in a clean environment with Python 3.11.8 instead of 3.12.2.

I also tried downgrading the package (to 0.7.0) and noticed that it prints out the number of granules found before the error:

>>> earthaccess.search_data(
...     short_name='GEDI02_A',
...     bounding_box=(31.52,-25.08,31.64,-24.99),
...     temporal=("2019-01-01", "2024-01-01"),
...     count=-1
... )
Granules found: 92
Traceback (most recent call last):
  File "/home/du23yow/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/du23yow/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/home/du23yow/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:
...

Any other ideas of what I could do?

maawoo avatar Mar 27 '24 09:03 maawoo

Okay, I found the explanation in this icepyx discussion. Ping @betolink 🙂 Any suggestion on using earthaccess.search_data and earthaccess.download with an updated requests session?

maawoo avatar Mar 27 '24 10:03 maawoo

Hi @maawoo, I think this could be resolved if we let users pass the proxy settings to requests, in the meantime you can manually get a session modify it and get the files but that defeats the purpose!

import earthaccess
from itertools import chain # to flatten the results

earthaccess.login()

# Define your proxy
proxy = {
    'http': 'http://your_proxy_address:port',
    'https': 'https://your_proxy_address:port'
}


results = earthaccess.search_data(
    short_name='GEDI02_A',
    bounding_box=(31.52,-25.08,31.64,-24.99),
    temporal=("2019-01-01", "2024-01-01"),
    count=-1
)

links = list(chain.from_iterable([r.data_links() for r in  results]))
session = earthaccess.get_requests_https_session()
session.proxies.update(proxy)

for url in links:
    local_filename = url.split("/")[-1]
    path = f"temp_dir/{local_filename}"
    with session.get(
          url,
          stream=True,
          allow_redirects=True,
      ) as r:
          r.raise_for_status()
          with open(path, "wb") as f:
              shutil.copyfileobj(r.raw, f, length=1024 * 1024)

This is not concurrent so there is room for improvement, as I said we should implement the proxy here but my guess is that it won't be ready in the next week.

betolink avatar Mar 27 '24 18:03 betolink

Thank you for the possible workaround!

my guess is that it won't be ready in the next week

No worries! I already have the data I need. My plan was to implement earthaccess into some scripts but that can wait for now.

maawoo avatar Mar 27 '24 18:03 maawoo