VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Document common obstore gotchas

Open maxrjones opened this issue 4 months ago • 4 comments

Since obstore is a newer library, it'd be useful to provide some documentation about common gotchas when creating an ObjectStore for use in virtualizarr. Here are some things I commonly mess up:

  • Don't include the scheme (e.g., s3://) when instantiating a store via a class method like obstore.store.S3Store(bucket=bucket).
    • For example use bucket="podaac-ops-cumulus-protected", not bucket="s3://podaac-ops-cumulus-protected".
  • Include client_options={"allow_http": True} when instantiating a store via a class method
    • For example,obstore.store.S3Store(bucket=bucket, client_options={"allow_http": True})`
  • Include virtual_hosted_style_request=False, when instantiating a store via a class method
    • For example obstore.store.S3Store(bucket=bucket, client_options={"allow_http": True}, virtual_hosted_style_request=False,)
  • s3 bucket access for most NASA DAACs requires being in-region (i.e., in AWS us-west-2)
  • Obstore access to AWS S3 always requires setting the region. Most non-AWS S3 compatible clouds do not.

Other useful tidbits: To go from earthaccess to an ObjectStore:

import earthaccess
import obstore
from urllib.parse import urlparse

granule_info = earthaccess.search_data(
    short_name="ECCO_L4_TEMP_SALINITY_05DEG_DAILY_V4R4",
    count=1
    )
data_s3links = [g.data_links(access="direct")[0] for g in granule_info]
url = data_s3links[0]
parsed = urlparse(url)

earthaccess.login()
creds = earthaccess.get_s3_credentials(daac="PODAAC")
store = obstore.store.S3Store(
    bucket=parsed.netloc,
    region="us-west-2",
    access_key_id = creds['accessKeyId'],
    secret_access_key= creds['secretAccessKey'],
    token=creds['sessionToken'],
    virtual_hosted_style_request=False,
    client_options={"allow_http": True},
)

(a separate issue could be making this simpler; xref https://github.com/nsidc/earthaccess/discussions/1051, https://github.com/nsidc/earthaccess/discussions/956)

maxrjones avatar Jul 21 '25 14:07 maxrjones

This is super helpful @maxrjones!

norlandrhagen avatar Jul 21 '25 16:07 norlandrhagen

This is great, but ideally as many of these as possible would be fixed or documented upstream surely?

TomNicholas avatar Jul 21 '25 17:07 TomNicholas

This is great, but ideally as many of these as possible would be fixed or documented upstream surely?

several of these are nuances about the intersection of virtualizarr and obstore, not bugs or upstream issues (e.g., virtual_hosted_style_request=False, allow_http=True). Possibly we can minimize these quirks

maxrjones avatar Jul 21 '25 18:07 maxrjones

I was trying to replace a glob and ended up with:

import obstore
import fnmatch

prefix = f"NEX-GDDP-CMIP6/GISS-E2-1-G/historical/"  
file_pattern = f"{prefix}r1i1p1*/tas/*"

store = obstore.store.from_url("s3://nex-gddp-cmip6/", region="us-west-2", skip_signature=True)

all_files = []
async for files in obstore.list(store, prefix=prefix):
    all_files += [f["path"] for f in files if fnmatch.fnmatch(f["path"], file_pattern)]

jsignell avatar Sep 19 '25 19:09 jsignell