Document common obstore gotchas
Since obstore is a newer library, it'd be useful to provide some documentation about common gotchas when creating an ObjectStore for use in virtualizarr. Here are some things I commonly mess up:
- Don't include the scheme (e.g.,
s3://) when instantiating a store via a class method likeobstore.store.S3Store(bucket=bucket).- For example use
bucket="podaac-ops-cumulus-protected", notbucket="s3://podaac-ops-cumulus-protected".
- For example use
- Include
client_options={"allow_http": True}when instantiating a store via a class method- For example,
obstore.store.S3Store(bucket=bucket, client_options={"allow_http": True})`
- For example,
- Include
virtual_hosted_style_request=False,when instantiating a store via a class method- For example
obstore.store.S3Store(bucket=bucket, client_options={"allow_http": True}, virtual_hosted_style_request=False,)
- For example
- s3 bucket access for most NASA DAACs requires being in-region (i.e., in AWS us-west-2)
- Obstore access to AWS S3 always requires setting the region. Most non-AWS S3 compatible clouds do not.
Other useful tidbits: To go from earthaccess to an ObjectStore:
import earthaccess
import obstore
from urllib.parse import urlparse
granule_info = earthaccess.search_data(
short_name="ECCO_L4_TEMP_SALINITY_05DEG_DAILY_V4R4",
count=1
)
data_s3links = [g.data_links(access="direct")[0] for g in granule_info]
url = data_s3links[0]
parsed = urlparse(url)
earthaccess.login()
creds = earthaccess.get_s3_credentials(daac="PODAAC")
store = obstore.store.S3Store(
bucket=parsed.netloc,
region="us-west-2",
access_key_id = creds['accessKeyId'],
secret_access_key= creds['secretAccessKey'],
token=creds['sessionToken'],
virtual_hosted_style_request=False,
client_options={"allow_http": True},
)
(a separate issue could be making this simpler; xref https://github.com/nsidc/earthaccess/discussions/1051, https://github.com/nsidc/earthaccess/discussions/956)
This is super helpful @maxrjones!
This is great, but ideally as many of these as possible would be fixed or documented upstream surely?
This is great, but ideally as many of these as possible would be fixed or documented upstream surely?
several of these are nuances about the intersection of virtualizarr and obstore, not bugs or upstream issues (e.g., virtual_hosted_style_request=False, allow_http=True). Possibly we can minimize these quirks
I was trying to replace a glob and ended up with:
import obstore
import fnmatch
prefix = f"NEX-GDDP-CMIP6/GISS-E2-1-G/historical/"
file_pattern = f"{prefix}r1i1p1*/tas/*"
store = obstore.store.from_url("s3://nex-gddp-cmip6/", region="us-west-2", skip_signature=True)
all_files = []
async for files in obstore.list(store, prefix=prefix):
all_files += [f["path"] for f in files if fnmatch.fnmatch(f["path"], file_pattern)]