staged-recipes
staged-recipes copied to clipboard
NASA SMAP SSS recipe
Draft PR which will close #30 when complete.
@rabernat, @jbusecke, and @hscannell: submitting this (very) rough first pass as a point for conversation around some structural questions (and a suggestion) that I've encountered so far. Interested in feedback regarding any of the below.
-
The word "pipeline" appears in a lot of places, including the README for the
staged-recipesrepo, and in the title of this issue (#30).- Question: Do I understand correctly that this language is out-of-date, as contributors will no longer be engaging with the
Prefectlayer, but rather contributingrecipe.pys andmeta.yamls only? If so, should we open an issue to re-write the README and associated docs?
- Question: Do I understand correctly that this language is out-of-date, as contributors will no longer be engaging with the
-
Why do we have so many things labeled
examplein issues? What's the difference between anexampleand just, a recipe staged by a maintainer?- Related Question: Is my current directory structure correct? I have opted to make a new directory under
recipes/rather than withinrecipes/examples/.
- Related Question: Is my current directory structure correct? I have opted to make a new directory under
-
Suggestion: I've opted to
pip install jupytext(https://jupytext.readthedocs.io/en/latest/index.html) into mystaged-recipedevelopment environment, so that I can execute myrecipe.pytext file line-by-line in Jupyter during development. (Without this dependency, in order to debug the recipe in Jupyter, I would've had to create a separaterecipe-dev.ipynbfile for development, and then copy-and-paste the relevant bits into a.pyfile for the PR.) What do we think about incorporating this dependency as part of the recommended contribution/development workflow?
- Do I understand correctly that this language is out-of-date, as contributors will no longer be engaging with the
Prefectlayer, but rather contributingrecipe.pys andmeta.yamls only? If so, should we open an issue to re-write the README and associated docs?
Yes to all of the above.
2. Why do we have so many things labeled
examplein issues? What's the difference between anexampleand just, a recipe staged by a maintainer?
We are sort of moving gradually from collecting hypothetical use cases to actual recipes. I would update this label to be "proposed recipe"
- Is my current directory structure correct?
Yes, it's fine. The current CI workflow (#28) will search for meta.yaml anywhere in the PR.
Going forward, I think we want to make the repo as simple, bare-bones, and self explanatory as possible. Feel free to propose changes in this direction.
3. What do we think about incorporating this dependency as part of the recommended contribution/development workflow?
:+1:
Why don't we open a new issue to track the improvements needed to the contributor workflow?
The recipe-dict in nasa-smap-sss/recipe.py now appears to contain valid recipes for all four datasets (JPL and RSS, each at both timescales).
As I move now into the (manual, notebook-based) execution phase, I will echo that the feature(s) discussed in https://github.com/pangeo-forge/pangeo-forge-recipes/issues/97 and https://github.com/pangeo-forge/pangeo-forge-recipes/issues/136 would presumably be useful even in manual execution settings.
My workaround was to estimate the source sizes as follows:
import numpy as np
import xarray as xr
for store in list(urls): # `urls` is a dictionary mapping of 'store_name' : source_url
ds = xr.open_dataset(urls[store][10]) # an arbitrary source file from each dataset
gbs = ds.nbytes/1e9
total_gbs = len(urls[store]) * gbs
print(f"{store} contains approx. {np.trunc(total_gbs)} GBs.")
which returns:
NASA-SMAP-SSS/JPL/8day contains approx. 81.0 GBs.
NASA-SMAP-SSS/JPL/monthly contains approx. 2.0 GBs.
NASA-SMAP-SSS/RSS/8day contains approx. 110.0 GBs.
NASA-SMAP-SSS/RSS/monthly contains approx. 3.0 GBs.
Based on this information, I decided to start by trying to execute the (considerably smaller) monthly recipes only, using as reference the notebook Ryan has used manually execute an eNATL60 recipe (see https://github.com/pangeo-forge/staged-recipes/pull/24#issuecomment-838757087). The notebook is not currently linkable in full as it contains secrets.
On the execution cell
for recipe_key, r in recipes.items():
if 'monthly' in recipe_key:
try:
r.open_target()
print(f"found {recipe_key}")
except:
print(f"RUNNING {recipe_key}")
pl = r.to_pipelines()
plan = executor.pipelines_to_plan(pl)
executor.execute_plan(plan)
else:
pass
I encountered the following errors:
- on
tryblock:GroupNotFoundError: group not found at path '', possibly related to https://github.com/pydata/xarray/issues/2586 - on
exceptblock: ValueError:Got more bytes so far (>15260565) than requested (15242880), possibly related to https://github.com/intake/filesystem_spec/issues/160
I do not expect these issues will be diagnosable without the full notebook context, but I'm logging this in outline form here as a touchpoint nonetheless. Ryan and I will be discussing synchronously on Monday, after which I will follow up on this thread with any generalizable takeaways.
Charles, yesterday we boiled this error down to a specific issue with fsspec. Would you mind sharing that code snippet here?
Yes, the error was being thrown by line 40 in storage.py here. The minimal example below recreates the error using fsspec.open() alone. (Traceback is included immediately below the example.)
As suggested in https://github.com/intake/filesystem_spec/issues/160#issuecomment-543803897, I was able to resolve this error by setting fsspec_open_kwargs = {'block_size': 0} when instantiating the recipe here. (In the minimal example, uncommenting open_kwargs achieves the same end.)
@martindurant, my lingering questions are:
- Is setting
fsspec_open_kwargs = {'block_size': 0}indeed your recommended solution to this problem? Or have I overlooked some disadvantage of this solution? - You note in the above-linked comment that this error arises when "fsspec would like to be able to random access the file by issuing Range requests, but the server doesn't respect this". Does the Traceback below stem from the same circumstance?
- If so, is there any way to anticipate which source file servers will struggle in this way?
- Should I link this report to any ongoing fsspec Issues?
from contextlib import contextmanager
from typing import Any, Iterator
import fsspec
# fsspec doesn't provide type hints, so I'm not sure what the write type is for open files
OpenFileType = Any
@contextmanager
def _fsspec_safe_open(fname: str, **kwargs) -> Iterator[OpenFileType]:
# workaround for inconsistent behavior of fsspec.open
# https://github.com/intake/filesystem_spec/issues/579
with fsspec.open(fname, **kwargs) as fp:
with fp as fp2:
yield fp2
base = 'https://podaac-opendap.jpl.nasa.gov/opendap/allData/'
fname = base + 'smap/L3/JPL/V5.0/8day_running/2015/120/SMAP_L3_SSS_20150504_8DAYS_V5.0.nc'
# open_kwargs = {'block_size': 0}
input_opener = _fsspec_safe_open(fname, mode="rb") #, **open_kwargs)
BLOCK_SIZE=10_000_000
with input_opener as source:
data = source.read(BLOCK_SIZE)
Traceback (click to expand)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-efd8fb2f2463> in <module>
25
26 with input_opener as source:
---> 27 data = source.read(BLOCK_SIZE)
~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/implementations/http.py in read(self, length)
482 else:
483 length = min(self.size - self.loc, length)
--> 484 return super().read(length)
485
486 async def async_fetch_all(self):
~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/spec.py in read(self, length)
1447 # don't even bother calling fetch
1448 return b""
-> 1449 out = self.cache._fetch(self.loc, self.loc + length)
1450 self.loc += len(out)
1451 return out
~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/caching.py in _fetch(self, start, end)
374 ):
375 # First read, or extending both before and after
--> 376 self.cache = self.fetcher(start, bend)
377 self.start = start
378 elif start < self.start:
~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
70 def wrapper(*args, **kwargs):
71 self = obj or args[0]
---> 72 return sync(self.loop, func, *args, **kwargs)
73
74 return wrapper
~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/asyn.py in sync(loop, func, timeout, *args, **kwargs)
51 event.wait(timeout)
52 if isinstance(result[0], BaseException):
---> 53 raise result[0]
54 return result[0]
55
~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout)
18 coro = asyncio.wait_for(coro, timeout=timeout)
19 try:
---> 20 result[0] = await coro
21 except Exception as ex:
22 result[0] = ex
~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/implementations/http.py in async_fetch_range(self, start, end)
544 cl += len(chunk)
545 if cl > end - start:
--> 546 raise ValueError(
547 "Got more bytes so far (>%i) than requested (%i)"
548 % (cl, end - start)
ValueError: Got more bytes so far (>15252381) than requested (15242880)
Noting that the PR referenced in the last commit is actually https://github.com/pangeo-forge/roadmap/pull/22, not the one linked in the commit message.
@sharkinsspatial, this is ready to be test-run through the bakery.
I've already manually executed the copy_pruned() versions of all the recipes contained in this PR's dict_object to Pangeo's OSN bucket. The plot below was created with this code block (credentials omitted, of course) at the bottom of the notebook.
Will there soon be a slash command that allows us to do a "test-bake" on the pruned subsets? (Apologies if the timeline on this was obvious from our other threads, still wrapping my head around all the layers here.)
cc @jbusecke, getting close!

Is setting fsspec_open_kwargs = {'block_size': 0} indeed your recommended solution to this problem?
This is saying "I want to view the whole file as a block" and will work fine. Really, the code is doing fs.get (not open/read), which should always do the right thing and also allow concurrent fetches.
You note in the above-linked comment that this error arises when "fsspec would like to be able to random access the file by issuing Range requests, but the server doesn't respect this". Does the Traceback below stem from the same circumstance?
Yes, probably. It is marginally possible (but not likely) that the server is not respecting the content encoding. The response header would have more information.
If so, is there any way to anticipate which source file servers will struggle in this way?
I'm afraid not. The HTTP response to HEAD or GET (before starting to download) might have useful markers, but this already depends on the server being well-behaved. Essentially, none of the header info keys are strictly required.
Should I link this report to any ongoing fsspec Issues?
There have certainly been ongoing conversations around this kind of thing, and the range of circumstances that fsspec can handle has steadily grown.
The plot below was created with this code block (credentials omitted, of course)
The OSN bucket is public for read only access. You can access it over s3 protocol with anon=True (see my OSN guide) or even http via `https://ncsa.osn.xsede.org/Pangeo/...'
I'm afraid not. The HTTP response to HEAD or GET (before starting to download) might have useful markers, but this already depends on the server being well-behaved. Essentially, none of the header info keys are strictly required.
Then let's try to explicitly catch this error in Pangeo forge and raise a detailed error message with the suggested workaround.
/run-recipe-test
@cisaacstern Can you include a pangeo_notebook_version at the the root of your meta.yaml. You can use this as an example https://github.com/pangeo-forge/staged-recipes/pull/36/files#diff-743ac37f3dbeb14ebdd6b873ade997238195d5652d365a37c52358662b001c6dR4. We use this to pin the image used by our bakery workers.
@cisaacstern As a note. In the short interim while we wait for a release of pangeo-forge-recipes including copy_pruned I'll register these recipes with the CI workflow and attempt to run one of smaller monthly recipes for validation.
/run-recipe-test
/run-recipe-test
/run-recipe-test
/run-recipe-test
@cisaacstern https://github.com/sharkinsspatial/zarr_examples/blob/main/nasa-smap-sss-jpl-monthly.ipynb 🎊
@jbusecke, the first two timesteps of each of the four datasets (two time intervals for each of two algorithms) are available on OSN as follows:
import s3fs
endpoint_url = 'https://ncsa.osn.xsede.org'
fs_osn = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': endpoint_url},)
fs_osn.ls("Pangeo/pangeo-forge/NASA-SMAP-SSS/JPL")
['Pangeo/pangeo-forge/NASA-SMAP-SSS/JPL/8day_pruned.zarr',
'Pangeo/pangeo-forge/NASA-SMAP-SSS/JPL/monthly_pruned.zarr']
fs_osn.ls("Pangeo/pangeo-forge/NASA-SMAP-SSS/RSS")
['Pangeo/pangeo-forge/NASA-SMAP-SSS/RSS/8day_pruned.zarr',
'Pangeo/pangeo-forge/NASA-SMAP-SSS/RSS/monthly_pruned.zarr']
@sharkinsspatial, were the complete time series ever built by the bakery, and if so are they publicly accessible somewhere?
Could we try re-running this recipe in our latest infrastructure?
Yes I'll change the bakery in meta.yaml, which once committed will signal the bot to create a new recipe run for this
It looks like your meta.yaml does not conform to the specification.
4 validation errors for MetaYaml
recipes -> 0 -> id
field required (type=value_error.missing)
recipes -> 0 -> object
field required (type=value_error.missing)
recipes
value is not a valid dict (type=type_error.dict)
maintainers -> 0 -> orcid
field required (type=value_error.missing)
Please correct your meta.yaml and commit the corrections to this PR.
The bot doesn't understand dict_objects yet... I'm going to see if I can quickly fix that...
A-ha. So it remains true that the bot does not understand dict_objects, but the validation error we see in https://github.com/pangeo-forge/staged-recipes/pull/31#issuecomment-1058307750 is actually because our meta.yaml currently gives
recipes:
- dict_object: "recipe:recipes"
when it should be a simple mapping (rather than a list), i.e.
recipes:
dict_object: "recipe:recipes"
I've noted this issue in https://github.com/pangeo-forge/roadmap/pull/49
pre-commit.ci autofix
Thanks for working on this @andersy005. Let me know if you need any input from my side!
/run NASA-SMAP-SSS/RSS/monthly
:tada: The test run of NASA-SMAP-SSS/RSS/monthly at e10df6b509ff5bf915de32b06f84ecb050783fad succeeded!
import xarray as xr
store = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1366/NASA-SMAP-SSS/RSS/monthly.zarr"
ds = xr.open_dataset(store, engine='zarr', chunks={})
ds
Seems like the data is not properly concatenated in time. There is a time dimension, but the data itself has no time dimensions?
/run NASA-SMAP-SSS/JPL/8day