earthaccess
earthaccess copied to clipboard
Document why signed S3 URLs might be giving 400s when called from inside us-west-2
Sometimes when you make a request to a URL behind earthdata login, after a series of redirects, you get sent to a signed S3 URL. This should be transparent to the client, as the URL itself contains all the authentication needed for access.
However, sometimes, in some clients, you get a generic 403 Forbidden here without much explanation. It has something to do with other auth being sent alongside (see https://github.com/nsidc/earthaccess/issues/187 for more vague info).
We should document what this is, and why you get the 403. This documentation would allow developing workarounds for various clients if needed.
You actually get a 400, and here is the smallest sample case:
import asyncio
import aiohttp
import netrc
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
username, _, password = netrc.netrc().authenticators('urs.earthdata.nasa.gov')
auth = aiohttp.BasicAuth(username, password)
async def main():
async with aiohttp.ClientSession(auth=auth) as session:
async with session.get(url) as response:
print(response.status)
print((await response.read())[:30])
asyncio.run(main())
When running from inside us-west-2, this prints:
400
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidArgument</Code><Message>Only one auth mechanism allowed; only the X-Amz-Algorithm query parameter, Signature query string parameter or the Authorization header should be specified</Message><ArgumentName>Authorization</ArgumentName><ArgumentValue>Basic eXV2aXBhbmRhOmFpc2hlZTh3b29naGFobmdpZW1vb3Nob0thaXhpaWJl</ArgumentValue><RequestId>XM26KTSJ4X85W6YR</RequestId><HostId>gjjlJGJmgjalTBXzAnnMg4eBl2MCd3k9UD4klvAO3Rjd18TOB3QCgDC3bAMwciPyIRrStqrD4SQ=</HostId></Error>
which is pretty clear and useful!
And on my laptop, this prints:
200
b'\x89HDF\r\n\x1a\n\x00\x00\x00\x00\x00\x08\x08\x00\x04\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
So we have a reproducible setup now. fsspec uses aiohttp under the hood, so this is the same issue fsspec is facing
This is likely the aiohttp bug actually: https://github.com/aio-libs/aiohttp/issues/2610
Here is the same code with requests:
import requests
import netrc
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
username, _, password = netrc.netrc().authenticators('urs.earthdata.nasa.gov')
resp = requests.get(url, auth=(username, password))
print(resp.status_code)
print(resp.content[:15])
This actually produces the correct output on both my laptop and on us-west-2!
200
b'\x89HDF\r\n\x1a\n\x00\x00\x00\x00\x00\x08\x08'
This is most likely because requests implemented https://github.com/request/request/pull/1184, while the equivalent bug with aiohttp is still open.
This is amazing news, as this means that fixing https://github.com/aio-libs/aiohttp/issues/2610 should get fsspec to work, which means most of the pangeo stack would work after that. It will still have lower performance than using s3 directly when in us-west-2, so work there still needs to be done. But this will at least make sure regular https URLs work when both inside and outside us-west-2
aiohttp has documented this should not be the case, based on the note here: https://docs.aiohttp.org/en/stable/client_advanced.html?highlight=redirects#custom-request-headers
I also looked at the request being made by aiohttp, and see the following:
RequestInfo(url=URL('https://nsidc-cumulus-prod-protected.s3.us-west-2.amazonaws.com/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5?A-userid=yuvipanda&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2D3OGJNTHYLUSH3P/20221215/us-west-2/s3/aws4_request&X-Amz-Date=20221215T210703Z&X-Amz-Expires=3109&X-Amz-Security-Token=FwoGZXIvYXdzEN7//////////wEaDDp5wsiHWectpsmbPiK4AdzdhJBq0QIbppB7sa9DQ2po6R29dB1t2g0ACyx3h4keIqL4FLppwe3TShd9rcdJqC11UxTiOKoiVUVcrt%2BbwLAcd8wfVIMfUpze8ChSWCekiBQtIzyJGeelId6jn38rPFD71lXGUeaM/di/BFT6txD5j9g8br7BuQI8Jhwycn93lWgKv8zrfGgHwREt6wIaQ63ugKpseloAeGO0le6pz9oPL5P4cYn9SZjhGa7LgqqeeRHIGQKCHHEojJXunAYyLe6bzYyOU0h/2QqKZrFudhm772RwPg0LuXexViJ1Ae28OYexT/8xDD68yfsWjg%3D%3D&X-Amz-SignedHeaders=host&X-Amz-Signature=dbca3da4e6e9f3c1257db628be1d4aaeb3b2f67d931d53bf27440db980edebf6'), method='GET', headers=<CIMultiDictProxy('Host': 'nsidc-cumulus-prod-protected.s3.us-west-2.amazonaws.com', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'User-Agent': 'Python/3.9 aiohttp/3.8.3', 'Authorization': 'Basic <removed>')>, real_url=URL('https://nsidc-cumulus-prod-protected.s3.us-west-2.amazonaws.com/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5?A-userid=yuvipanda&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2D3OGJNTHYLUSH3P/20221215/us-west-2/s3/aws4_request&X-Amz-Date=20221215T210703Z&X-Amz-Expires=3109&X-Amz-Security-Token=FwoGZXIvYXdzEN7//////////wEaDDp5wsiHWectpsmbPiK4AdzdhJBq0QIbppB7sa9DQ2po6R29dB1t2g0ACyx3h4keIqL4FLppwe3TShd9rcdJqC11UxTiOKoiVUVcrt%2BbwLAcd8wfVIMfUpze8ChSWCekiBQtIzyJGeelId6jn38rPFD71lXGUeaM/di/BFT6txD5j9g8br7BuQI8Jhwycn93lWgKv8zrfGgHwREt6wIaQ63ugKpseloAeGO0le6pz9oPL5P4cYn9SZjhGa7LgqqeeRHIGQKCHHEojJXunAYyLe6bzYyOU0h/2QqKZrFudhm772RwPg0LuXexViJ1Ae28OYexT/8xDD68yfsWjg%3D%3D&X-Amz-SignedHeaders=host&X-Amz-Signature=dbca3da4e6e9f3c1257db628be1d4aaeb3b2f67d931d53bf27440db980edebf
'))
So I think this confirms that the Authorization header is being retained during redirects.
I've now found @betolink's comment in https://github.com/aio-libs/aiohttp/issues/5783#issuecomment-981958210, and made me realize that what we want is for the credentials to be forwarded when we are redirected to earthdata login, but then dropped. But what we are getting is instead it being sent to everything
AHA, so what's actually happening is that we are setting the basic auth on the session, rather than on the request. So it's being sent to every request from the session, including S3! This actually now is unrelated to the aiohttp bug
if I move the auth= to just the request, I get a basic 401 denied, as the Basic auth is dropped during the redirect, which is correct and documented aiohttp behavior.
So the question now really is why does requests work?
Separately, it should be possible for us to subclass aiohttp's ClientSession to pass per-host basicauth so it can provide appropriate auth to different hosts in the chain, and just send basic auth to earthdata.
ok, so I have discovered why it works with requests but not with aiohttp.
It is because requests supports netrc lol!
So at the first redirect, requests drops the Authorization header, but when making the request to EDL, it reads netrc file directly and sends the appropriate credentials! So that is why it works by default with requests, and not with aiohttp.
So to summarize, the current problem is that we pass parameters to fsspec that are set at the ClientSession level, and those are sent with every request. So the Authorization header is also sent when making the request to S3, and it fails. This is validated with the following code:
import asyncio
import aiohttp
import netrc
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
username, _, password = netrc.netrc().authenticators('urs.earthdata.nasa.gov')
auth = aiohttp.BasicAuth(username, password)
async def main():
async with aiohttp.ClientSession() as session:
async with session.get(url, auth=auth) as response:
print(response.status)
print((await response.read())[:30])
asyncio.run(main())
This actually will fail with a HTTP Basic request denied error anywhere, which makes sense - the Authorization header is dropped at the first redirect to EDL, and then we get an access denied.
If I recreate this with requests by deleting my netrc file:
import requests
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
username = "yuvipanda"
password = "mypassword"
resp = requests.get(url, auth=(username, password))
print(resp.status_code)
print(resp.content[:15])
I get the exact same behavior.
WHICH IS GREAT! So the problem now isn't to do with redirects at all, it is really - how do we make sure to send the HTTP Basic Creds just to EDL? Because right now, the reason this works with non-cloud datasets is that we are actually leaking plaintext EDL creds to all of them, completely negating the point of OAuth2 :D
I see trust_env passed along to the aiohttp session, but aiohttp only uses this for proxies, not for authenticating to servers themselves.
So the current issue is really that aiohttp has no way to say 'for this domain, send this authentication information'. requests accidentally provides this with netrc, but otherwise doesn't afaict.
So, netrc support is actually the easiest way to make sure that we can send specific Basic Auth credentials only to specific Hosts. So I made this PR adding it to aiohttp! https://github.com/aio-libs/aiohttp/pull/7131
If merged and released, this should sort of automatically make fsspec work again.
Amazing work @yuvipanda! I'm just catching up with this thread. One thing I'd like to mention is that -if possible- it would be preferable to have a solution/workaround that does not rely on having a .netrc (even thought is what we been doing for the tutorials).
@betolink so I think these tokens (https://urs.earthdata.nasa.gov/documentation/for_users/user_token) should get rid of the need for netrc completely. I have no idea why people are restricted to just two tokens per user - that makes it definitely harder to use :(
I dug some more into what fsspec would need to do for us to use client tokens.
fsspec currently supports a client_kwargs that allows setting headers and other misc options for all requests. This accidentally works now when making requests behind EDL from outside us-west-2, but doesn't work from inside (for all the reasons outlined in this issue). So we can not use the auth tokens with it either.
What we need is something like request_kwargs (that is passed into places like https://github.com/fsspec/filesystem_spec/blob/45de5b509bacf8a62d99848bb2361cc78733ad09/fsspec/implementations/http.py#L242 and everywhere else requests are constructed). This allows these params to be set just for the originating request, but not for any follow-on redirects from there. This wouldn't help when using username / password for EDL (as the username / password needs to be sent for a request along the redirect path, not the originating request), but would work for using tokens (as they must be only sent to the originating request).
I think this is a fairly well scoped and small change to fsspec that would be extremely useful! I'm super swamped though, I am hoping someone else can implement this?
Opened https://github.com/fsspec/filesystem_spec/issues/1142 to discuss what would help solve the issue from fsspec in allowing us to use tokens!
Turns out this already exists in fsspec - any kwargs you pass in actually get passed directly to the requests, exactly what we wanted!
So the following code works for me :)
from fsspec.implementations.http import HTTPFileSystem
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
token = 'my-long-token'
fs = HTTPFileSystem(headers={
"Authorization": f"Bearer {token}"
})
with fs.open(url) as f:
print(f.read()[:30])
yay!
ok, so current summary is:
- https://github.com/aio-libs/aiohttp/pull/7131 adds
.netrcsupport to aiohttp, and hence to fsspec. This is needed for earthdata login access to work consistently in AWS us-west-2 with fsspec the same way it works elsewhere, while using earthdata username / password to login. - However, I think we should recommend everyone use tokens for actually authenticating programmatically - https://urs.earthdata.nasa.gov/documentation/for_users/user_token. This already works with
fsspec- just passheadersas a kwargs as shown in the comment above, rather than as a part ofclient_kwargs. yay!
Unfortunately, there is a limit of only two tokens per user in earthdata login right now, so you can not just generate a token for each machine you would use it in, like with GitHub Personal Access token. However, the lack of need for specific files means this would also work with dask.
Here is an example of it working with xarray!
from fsspec.implementations.http import HTTPFileSystem
import xarray as xr
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
token = 'my-long-token'
fs = HTTPFileSystem(headers={
"Authorization": f"bearer {token}"
})
ds = xr.open_dataset(fs.open(url))
ds
This is awesome @yuvipanda! I feel like we need to refactor this library to only use CMR tokens everywhere instead of monkey-patching OAuth2 redirects for cloud-hosted data. I wish DAAC hosted data would follow the same behavior with bearer tokens. i.e.
# bearer token for the win with cloud hosted data !!
# url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
# =( bearer token? don't know him.
url = "https://n5eil01u.ecs.nsidc.org/DP7/ATLAS/ATL08.005/2019.02.21/ATL08_20190221121851_08410203_005_01.h5"
Also, maybe we only need one token even if we use it concurrently from different processes? I haven't tested but I suspect it should work.
@betolink yeah we should only need one token even if it is used concurrently.
So the token only works for some datasets but not all? And works for cloud datasets but not on-prem? Does it work for any on prem thing at all?
I'm afraid it won't work for on-prem data, it may work for some data hosted at the ASF DAAC marked on-prem but actually hosted at AWS.
This is tremendous progress! Now there is a clear path for one of the most common access patterns!
@betolink feels like long term, the right way is to get the access token to work for all data, and support the earthdatalogin folks in this misison. In the meantime, netrc is the more universal solution, once we get the aiohttp pr merged. But that is slightly messy when it comes to dask, because it requires populating a specific file in the dask worker which is not always easy. Does that sound right?
Me and @briannapagan did another bit of deep dive here, and made some more progress.
There seem to be two primary packages supporting earthdata login on the server side:
- TEA (https://github.com/asfadmin/thin-egress-app/) - this is what does the work for cloud hosted data
- An apache2 module (https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/browse) which seems to be used by most on-prem datacenters. It also seems to provide authentication for most OpenDAP servers (https://opendap.github.io/hyrax_guide/Master_Hyrax_Guide.html#_earthdata_login_oauth2).
We have established that TEA already supports bearer tokens (https://github.com/asfadmin/thin-egress-app/blob/7b0f7110b1694f553af2b71594cc19e40c179ea9/lambda/app.py#L183). But what of the apache2 module?!
As of Sep 2021, it also supports bearer tokens! https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/e13ddeb1c3be7767a3214191f9de31e8cc311187 is the appropriate merge commit, and we discovered an internal JIRA ticket named URSFOUR-1600 that also tracks this feature.
With some more sleuthing, we discovered https://forum.earthdata.nasa.gov/viewtopic.php?t=3290. We tracked that through looking for URSFOUR-1858, mentioned in https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/8c4796c0467a1d5dcb8740fb86f23474db8258e3. That merge was the only further activity on the apache module since the merge for token support. Looking through that earthdata forum post, we see that LPDAAC (which maintains the dataset talked about there) mentions deploying 'some apache change' to help with that. So the hypothesis I had was:
- LPDAAC ran into some other unrelated issue,
- Which required code changes to the apache module, which was done via URSFOUR-1858
- They have deployed this change to their servers
- However, since this change was deployed , it is also likely that LPDAAC has included URS-1600 (user token support) in the deployment as well. Not necessarily explicitly, but just as a side effect of trying to deploy the more recent URSFOUR-1858.
I tested this hypothesis by trying to send a token to https://e4ftl01.cr.usgs.gov/ASTT/AG5KMMOH.041/2001.04.01/ASTER_GEDv4.1_A2001091.h5 - a dataset hosted by LPDAAC. And behold, it works! So all data hosted by LPDAAC supports tokens :)
So the pathway to using tokens everywhere, including onprem, boils down to getting all the DAACs to use the latest version of the official earthdata apache2 module.
This is great news for many reasons:
- No new code needs to be written! This all is already done.
- LPDAAC already deployed this, so it isn't a brand new deployment
- This is the official apache module that DAACs are already using, not some newfangled new software.
Also, passing -v to curl will send you back the response headers, which usually contain < Server: Apache to indicate they are using the apache2 server - and hence most likely 'on-prem' (aka not coming from S3)
NSIDC also seems to have the latest version of the apache module - https://n5eil01u.ecs.nsidc.org/DP7/ATLAS/ATL06.005/2020.03.08/ATL06_20200308234154_11190602_005_01.h5 works with the token!
So looks like some (many?) DAACs have this deployed, and some don't.
@betolink in fact, the exact URL you used to test tokens earlier in https://github.com/nsidc/earthaccess/issues/188#issuecomment-1364042546 works now. My suspicion is that NSIDC deployed the latest version of the apache2 module very recently?
ASDC also supports tokens, as tested with https://asdc.larc.nasa.gov/data/CALIPSO/LID_L2_VFM-Standard-V4-20/2010/09/CAL_LID_L2_VFM-Standard-V4-20.2010-09-01T00-14-43ZN.hdf.
Again, I'm using the presence of Server: apache to distinguish on-prem vs S3 hosted data. I think it's reasonably accurate.
ORNL also supports it, as tested via https://daac.ornl.gov/daacdata/deltax/DeltaX_Ecogeomorphic_Products/data/DeltaX_EcoGeoCells_2021_TerrebonneEast_std_superpixels.tif.
Note that uppercase Bearer is what I'm using, as that's what the apache module supports (see line 684 in https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/e13ddeb1c3be7767a3214191f9de31e8cc311187#mod_auth_urs.c).