Refactor codebase to support a new simplified Parser->ManifestStore model.
- [x] Closes https://github.com/zarr-developers/VirtualiZarr/issues/561 https://github.com/zarr-developers/VirtualiZarr/issues/553 https://github.com/zarr-developers/VirtualiZarr/issues/498 https://github.com/zarr-developers/VirtualiZarr/issues/476 https://github.com/zarr-developers/VirtualiZarr/issues/559
- [x] Tests added
- [x] Tests passing
- [x] Full type hint coverage
- [ ] Changes are documented in
docs/releases.rst - [ ] New functions/methods are listed in
api.rst - [ ] New functionality has documentation
Codecov Report
:x: Patch coverage is 87.82609% with 56 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 88.05%. Comparing base (639e8d4) to head (6d753d9).
Additional details and impacted files
@@ Coverage Diff @@
## develop #601 +/- ##
===========================================
- Coverage 89.39% 88.05% -1.35%
===========================================
Files 34 32 -2
Lines 1943 1783 -160
===========================================
- Hits 1737 1570 -167
- Misses 206 213 +7
| Files with missing lines | Coverage Δ | |
|---|---|---|
| virtualizarr/__init__.py | 77.77% <100.00%> (ø) |
|
| virtualizarr/parallel.py | 84.37% <100.00%> (-15.63%) |
:arrow_down: |
| virtualizarr/parsers/__init__.py | 100.00% <100.00%> (ø) |
|
| virtualizarr/parsers/hdf/filters.py | 97.97% <ø> (ø) |
|
| virtualizarr/parsers/dmrpp.py | 94.40% <97.87%> (ø) |
|
| virtualizarr/parsers/kerchunk_json.py | 96.29% <96.29%> (ø) |
|
| virtualizarr/parsers/zarr.py | 95.77% <94.44%> (ø) |
|
| virtualizarr/utils.py | 67.21% <85.71%> (-7.48%) |
:arrow_down: |
| virtualizarr/parsers/fits.py | 88.88% <88.88%> (ø) |
|
| virtualizarr/parsers/netcdf3.py | 88.23% <88.23%> (ø) |
|
| ... and 7 more |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@TomNicholas Not sure how to diagnose this lithops failure https://github.com/zarr-developers/VirtualiZarr/actions/runs/15167505617/job/42649054260?pr=601. It is working on my machine 😆 but the trace looks lithops specific.
this lithops failure
Umm that's weird. The CI is using lithops 3.6.0 - what are you using on your machine? (Though it worked on mine too ofc and 3.6.0 was released in May, so unlikely to be the problem...)
Is some temporary directory not being properly cleaned up?
@TomNicholas I'm on 3.6 as well 🤔. I'll try to investigate a bit and see why only this preprocess function is failing on the GH CI.
@maxrjones @TomNicholas I made some initial attempts to replace the kerchunk parser's use of fsspec's LazyReferenceMapper for reading references from parquet with a fully independent obstore implementation so that our only fsspec dependency would be for the FITS and NetCDF3 parsers that would use kerchunk readers under the hood. But it has been pretty difficult to disentangle the LazyReferenceMapper logic into something more simplified. At this point I'm not sure if it is worth the time for this PR, but it could complicate the use case where folks want to read from kerchunk parquet refs stored in object storage that requires fsspec storage options to access. Any thoughts on this?
@maxrjones @TomNicholas I made some initial attempts to replace the kerchunk parser's use of fsspec's LazyReferenceMapper for reading references from parquet with a fully independent obstore implementation so that our only fsspec dependency would be for the FITS and NetCDF3 parsers that would use kerchunk readers under the hood. But it has been pretty difficult to disentangle the
LazyReferenceMapperlogic into something more simplified. At this point I'm not sure if it is worth the time for this PR, but it could complicate the use case where folks want to read from kerchunk parquet refs stored in object storage that requiresfsspecstorage options to access. Any thoughts on this?
thanks for your work on this! I agree with limiting the time investment here. what do you think about splitting the kerchunk parser into a KerchunkJSONParser and KerchunkParquetParser where the latter has a reader_options configuration that it used to setup the LazyReferenceMapper and the former benefits from the work you've put into using obstore where possible.
replace the kerchunk parser's use of fsspec's LazyReferenceMapper for reading references from parquet with a fully independent obstore implementation so that our only fsspec dependency would be for the FITS and NetCDF3 parsers that would use kerchunk readers under the hood.
That's a sweet idea, but this PR already does too much in one go!!
I agree with @maxrjones that instead of completely rewriting yet another part of the codebase in this PR we just want some kind of minimal shim for now.
what do you think about splitting the kerchunk parser into a
KerchunkJSONParserandKerchunkParquetParserwhere the latter has areader_optionsconfiguration that it used to setup the LazyReferenceMapper
This isn't a bad idea for that shim. They are separate formats on disk, it would make a lot of sense for them to have separate parsers. If we changed our minds later we could potentially hide both behind an overarching KerchunkParser anyway. We would potentially need a 3rd KerchunkDictParser though too (or maybe not - I guess the dict is in-memory JSON).
@TomNicholas I'm still seeing intermittent test failures (and hangs) with the lithops executor that I'm not able to replicate. @maxrjones and I chatted a bit and in interest of getting the other docs PRs finished and aligned, think it might be best to try and get this merged and then investigate the lithops issues separately. I've temporarily xfailed the lithops tests and @chuckwondo and I will investigate separately as he was hitting the same issues with some previous work he was doing.
This is the more "real-world" case I'm trying to use to test this PR:
from urllib.parse import urlparse
import earthaccess
from obstore.auth.earthdata import NasaEarthdataCredentialProvider
from obstore.store import S3Store
from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import HDFParser
url = 's3://gesdisc-cumulus-prod-protected/GLDAS/GLDAS_NOAH025_3H.2.1/2022/091/GLDAS_NOAH025_3H.A20220401.0000.021.nc4'
auth = earthaccess.login()
credential_provider = NasaEarthdataCredentialProvider(
username=auth.username,
password=auth.password,
)
parsed = urlparse(url)
bucket = parsed.netloc
store = S3Store(bucket, credential_provider=credential_provider)
ds = open_virtual_dataset(url, object_store = store, parser=HDFParser())
ds
But it's erroring on the obstore_reader:
obstore.exceptions.PermissionDeniedError: The operation lacked the necessary privileges to complete for path GLDAS_NOAH025_3H.A20220401.0000.021.nc4: Error performing HEAD https://s3.us-west-2.amazonaws.com/gesdisc-cumulus-prod-protected/GLDAS_NOAH025_3H.A20220401.0000.021.nc4 in 367.69725ms - Server returned non-2xx status code: 403 Forbidden:
Debug source:
PermissionDenied {
path: "GLDAS_NOAH025_3H.A20220401.0000.021.nc4",
source: RetryError {
method: HEAD,
uri: Some(
https://s3.us-west-2.amazonaws.com/gesdisc-cumulus-prod-protected/GLDAS_NOAH025_3H.A20220401.0000.021.nc4,
),
retries: 0,
max_retries: 10,
elapsed: 367.69725ms,
retry_timeout: 180s,
inner: Status {
status: 403,
body: Some(
"",
),
},
},
}
I'm not sure yet it this is a regression from the fsspec-based implementation, if it was never working, or if I'm just setting up the credentials wrong.
Looking at the path in the request, I think that part of the issue is the lack of support for non-prefix stores. But I still get a forbidden error with prefix store support in https://github.com/zarr-developers/VirtualiZarr/tree/remove-default-store.
Thanks, @sharkinsspatial - that's annoying about Lithops but it definitely makes sense to push that problem to another PR. Would be good to open an issue so we don't forget about it.
@maxrjones ' issue seems like a @kylebarron question?
It's hard to say from that traceback what the error is. Is it from auth? Or are you calling HEAD on a directory?
@maxrjones, currently, the obstore NasaEarthdataCredentialProvider works only for PODAAC holdings.
However, I have a draft PR I'm currently working on to fix things so that it works for everything, and also adds support for env vars and bearer tokens.
See https://github.com/developmentseed/obstore/pull/472
@maxrjones I think most of your comments have been addressed if you'd like to have another look over this when you have a moment 🙏 .
@sharkinsspatial, I'll make some time to review as well, from the perspective of someone much less familiar with all of this logic than the rest of you :-) (I'll be on the lookout for missing context managers in particular!)
Amazing work @maxrjones, @sharkinsspatial and @chuckwondo +!
@sharkinsspatial and @chuckwondo carried the team here 🙇♂️
@maxrjones I think this is ready to merge 🙀 😆 . Do you want to do the honors?