VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Refactor codebase to support a new simplified Parser->ManifestStore model.

Open sharkinsspatial opened this issue 7 months ago • 14 comments

  • [x] Closes https://github.com/zarr-developers/VirtualiZarr/issues/561 https://github.com/zarr-developers/VirtualiZarr/issues/553 https://github.com/zarr-developers/VirtualiZarr/issues/498 https://github.com/zarr-developers/VirtualiZarr/issues/476 https://github.com/zarr-developers/VirtualiZarr/issues/559
  • [x] Tests added
  • [x] Tests passing
  • [x] Full type hint coverage
  • [ ] Changes are documented in docs/releases.rst
  • [ ] New functions/methods are listed in api.rst
  • [ ] New functionality has documentation

sharkinsspatial avatar May 20 '25 17:05 sharkinsspatial

Codecov Report

:x: Patch coverage is 87.82609% with 56 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 88.05%. Comparing base (639e8d4) to head (6d753d9).

Files with missing lines Patch % Lines
virtualizarr/parsers/tiff.py 0.00% 18 Missing :warning:
virtualizarr/xarray.py 82.25% 11 Missing :warning:
virtualizarr/parsers/hdf/hdf.py 95.27% 6 Missing :warning:
virtualizarr/manifests/store.py 80.00% 4 Missing :warning:
virtualizarr/parsers/kerchunk_parquet.py 90.90% 4 Missing :warning:
virtualizarr/parsers/utils.py 72.72% 3 Missing :warning:
virtualizarr/parsers/fits.py 88.88% 2 Missing :warning:
virtualizarr/parsers/netcdf3.py 88.23% 2 Missing :warning:
virtualizarr/translators/kerchunk.py 92.00% 2 Missing :warning:
virtualizarr/parsers/dmrpp.py 97.87% 1 Missing :warning:
... and 3 more
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #601      +/-   ##
===========================================
- Coverage    89.39%   88.05%   -1.35%     
===========================================
  Files           34       32       -2     
  Lines         1943     1783     -160     
===========================================
- Hits          1737     1570     -167     
- Misses         206      213       +7     
Files with missing lines Coverage Δ
virtualizarr/__init__.py 77.77% <100.00%> (ø)
virtualizarr/parallel.py 84.37% <100.00%> (-15.63%) :arrow_down:
virtualizarr/parsers/__init__.py 100.00% <100.00%> (ø)
virtualizarr/parsers/hdf/filters.py 97.97% <ø> (ø)
virtualizarr/parsers/dmrpp.py 94.40% <97.87%> (ø)
virtualizarr/parsers/kerchunk_json.py 96.29% <96.29%> (ø)
virtualizarr/parsers/zarr.py 95.77% <94.44%> (ø)
virtualizarr/utils.py 67.21% <85.71%> (-7.48%) :arrow_down:
virtualizarr/parsers/fits.py 88.88% <88.88%> (ø)
virtualizarr/parsers/netcdf3.py 88.23% <88.23%> (ø)
... and 7 more

... and 3 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar May 20 '25 18:05 codecov[bot]

@TomNicholas Not sure how to diagnose this lithops failure https://github.com/zarr-developers/VirtualiZarr/actions/runs/15167505617/job/42649054260?pr=601. It is working on my machine 😆 but the trace looks lithops specific.

sharkinsspatial avatar May 21 '25 17:05 sharkinsspatial

this lithops failure

Umm that's weird. The CI is using lithops 3.6.0 - what are you using on your machine? (Though it worked on mine too ofc and 3.6.0 was released in May, so unlikely to be the problem...)

Is some temporary directory not being properly cleaned up?

TomNicholas avatar May 22 '25 04:05 TomNicholas

@TomNicholas I'm on 3.6 as well 🤔. I'll try to investigate a bit and see why only this preprocess function is failing on the GH CI.

sharkinsspatial avatar May 22 '25 16:05 sharkinsspatial

@maxrjones @TomNicholas I made some initial attempts to replace the kerchunk parser's use of fsspec's LazyReferenceMapper for reading references from parquet with a fully independent obstore implementation so that our only fsspec dependency would be for the FITS and NetCDF3 parsers that would use kerchunk readers under the hood. But it has been pretty difficult to disentangle the LazyReferenceMapper logic into something more simplified. At this point I'm not sure if it is worth the time for this PR, but it could complicate the use case where folks want to read from kerchunk parquet refs stored in object storage that requires fsspec storage options to access. Any thoughts on this?

sharkinsspatial avatar May 26 '25 22:05 sharkinsspatial

@maxrjones @TomNicholas I made some initial attempts to replace the kerchunk parser's use of fsspec's LazyReferenceMapper for reading references from parquet with a fully independent obstore implementation so that our only fsspec dependency would be for the FITS and NetCDF3 parsers that would use kerchunk readers under the hood. But it has been pretty difficult to disentangle the LazyReferenceMapper logic into something more simplified. At this point I'm not sure if it is worth the time for this PR, but it could complicate the use case where folks want to read from kerchunk parquet refs stored in object storage that requires fsspec storage options to access. Any thoughts on this?

thanks for your work on this! I agree with limiting the time investment here. what do you think about splitting the kerchunk parser into a KerchunkJSONParser and KerchunkParquetParser where the latter has a reader_options configuration that it used to setup the LazyReferenceMapper and the former benefits from the work you've put into using obstore where possible.

maxrjones avatar May 26 '25 22:05 maxrjones

replace the kerchunk parser's use of fsspec's LazyReferenceMapper for reading references from parquet with a fully independent obstore implementation so that our only fsspec dependency would be for the FITS and NetCDF3 parsers that would use kerchunk readers under the hood.

That's a sweet idea, but this PR already does too much in one go!!

I agree with @maxrjones that instead of completely rewriting yet another part of the codebase in this PR we just want some kind of minimal shim for now.

what do you think about splitting the kerchunk parser into a KerchunkJSONParser and KerchunkParquetParser where the latter has a reader_options configuration that it used to setup the LazyReferenceMapper

This isn't a bad idea for that shim. They are separate formats on disk, it would make a lot of sense for them to have separate parsers. If we changed our minds later we could potentially hide both behind an overarching KerchunkParser anyway. We would potentially need a 3rd KerchunkDictParser though too (or maybe not - I guess the dict is in-memory JSON).

TomNicholas avatar May 27 '25 02:05 TomNicholas

@TomNicholas I'm still seeing intermittent test failures (and hangs) with the lithops executor that I'm not able to replicate. @maxrjones and I chatted a bit and in interest of getting the other docs PRs finished and aligned, think it might be best to try and get this merged and then investigate the lithops issues separately. I've temporarily xfailed the lithops tests and @chuckwondo and I will investigate separately as he was hitting the same issues with some previous work he was doing.

sharkinsspatial avatar May 27 '25 20:05 sharkinsspatial

This is the more "real-world" case I'm trying to use to test this PR:

from urllib.parse import urlparse

import earthaccess
from obstore.auth.earthdata import NasaEarthdataCredentialProvider
from obstore.store import S3Store

from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import HDFParser

url = 's3://gesdisc-cumulus-prod-protected/GLDAS/GLDAS_NOAH025_3H.2.1/2022/091/GLDAS_NOAH025_3H.A20220401.0000.021.nc4'
auth = earthaccess.login()
credential_provider = NasaEarthdataCredentialProvider(
    username=auth.username,
    password=auth.password,
)
parsed = urlparse(url)
bucket = parsed.netloc
store = S3Store(bucket, credential_provider=credential_provider)
ds = open_virtual_dataset(url, object_store = store, parser=HDFParser())
ds

But it's erroring on the obstore_reader:

obstore.exceptions.PermissionDeniedError: The operation lacked the necessary privileges to complete for path GLDAS_NOAH025_3H.A20220401.0000.021.nc4: Error performing HEAD https://s3.us-west-2.amazonaws.com/gesdisc-cumulus-prod-protected/GLDAS_NOAH025_3H.A20220401.0000.021.nc4 in 367.69725ms - Server returned non-2xx status code: 403 Forbidden: 

Debug source:
PermissionDenied {
    path: "GLDAS_NOAH025_3H.A20220401.0000.021.nc4",
    source: RetryError {
        method: HEAD,
        uri: Some(
            https://s3.us-west-2.amazonaws.com/gesdisc-cumulus-prod-protected/GLDAS_NOAH025_3H.A20220401.0000.021.nc4,
        ),
        retries: 0,
        max_retries: 10,
        elapsed: 367.69725ms,
        retry_timeout: 180s,
        inner: Status {
            status: 403,
            body: Some(
                "",
            ),
        },
    },
}

I'm not sure yet it this is a regression from the fsspec-based implementation, if it was never working, or if I'm just setting up the credentials wrong.

Looking at the path in the request, I think that part of the issue is the lack of support for non-prefix stores. But I still get a forbidden error with prefix store support in https://github.com/zarr-developers/VirtualiZarr/tree/remove-default-store.

maxrjones avatar May 28 '25 01:05 maxrjones

Thanks, @sharkinsspatial - that's annoying about Lithops but it definitely makes sense to push that problem to another PR. Would be good to open an issue so we don't forget about it.

@maxrjones ' issue seems like a @kylebarron question?

TomNicholas avatar May 28 '25 02:05 TomNicholas

It's hard to say from that traceback what the error is. Is it from auth? Or are you calling HEAD on a directory?

kylebarron avatar May 28 '25 02:05 kylebarron

@maxrjones, currently, the obstore NasaEarthdataCredentialProvider works only for PODAAC holdings.

However, I have a draft PR I'm currently working on to fix things so that it works for everything, and also adds support for env vars and bearer tokens.

See https://github.com/developmentseed/obstore/pull/472

chuckwondo avatar May 28 '25 10:05 chuckwondo

@maxrjones I think most of your comments have been addressed if you'd like to have another look over this when you have a moment 🙏 .

sharkinsspatial avatar May 30 '25 23:05 sharkinsspatial

@sharkinsspatial, I'll make some time to review as well, from the perspective of someone much less familiar with all of this logic than the rest of you :-) (I'll be on the lookout for missing context managers in particular!)

chuckwondo avatar May 31 '25 11:05 chuckwondo

Amazing work @maxrjones, @sharkinsspatial and @chuckwondo +!

@sharkinsspatial and @chuckwondo carried the team here 🙇‍♂️

maxrjones avatar Jun 16 '25 19:06 maxrjones

@maxrjones I think this is ready to merge 🙀 😆 . Do you want to do the honors?

sharkinsspatial avatar Jun 16 '25 20:06 sharkinsspatial