staged-recipes NASA SMAP SSS recipe

Draft PR which will close #30 when complete.

@rabernat, @jbusecke, and @hscannell: submitting this (very) rough first pass as a point for conversation around some structural questions (and a suggestion) that I've encountered so far. Interested in feedback regarding any of the below.

The word "pipeline" appears in a lot of places, including the README for the staged-recipes repo, and in the title of this issue (#30).
- Question: Do I understand correctly that this language is out-of-date, as contributors will no longer be engaging with the Prefect layer, but rather contributing recipe.pys and meta.yamls only? If so, should we open an issue to re-write the README and associated docs?
Why do we have so many things labeled example in issues? What's the difference between an example and just, a recipe staged by a maintainer?
- Related Question: Is my current directory structure correct? I have opted to make a new directory under recipes/ rather than within recipes/examples/.
Suggestion: I've opted to pip install jupytext (https://jupytext.readthedocs.io/en/latest/index.html) into my staged-recipe development environment, so that I can execute my recipe.py text file line-by-line in Jupyter during development. (Without this dependency, in order to debug the recipe in Jupyter, I would've had to create a separate recipe-dev.ipynb file for development, and then copy-and-paste the relevant bits into a .py file for the PR.) What do we think about incorporating this dependency as part of the recommended contribution/development workflow?

May 12 '21 22:05 cisaacstern

Do I understand correctly that this language is out-of-date, as contributors will no longer be engaging with the Prefect layer, but rather contributing recipe.pys and meta.yamls only? If so, should we open an issue to re-write the README and associated docs?

Yes to all of the above.

2. Why do we have so many things labeled example in issues? What's the difference between an example and just, a recipe staged by a maintainer?

We are sort of moving gradually from collecting hypothetical use cases to actual recipes. I would update this label to be "proposed recipe"

Is my current directory structure correct?

Yes, it's fine. The current CI workflow (#28) will search for meta.yaml anywhere in the PR.

Going forward, I think we want to make the repo as simple, bare-bones, and self explanatory as possible. Feel free to propose changes in this direction.

3. What do we think about incorporating this dependency as part of the recommended contribution/development workflow?

:+1:

May 14 '21 15:05 rabernat

Why don't we open a new issue to track the improvements needed to the contributor workflow?

May 14 '21 15:05 rabernat

The recipe-dict in nasa-smap-sss/recipe.py now appears to contain valid recipes for all four datasets (JPL and RSS, each at both timescales).

As I move now into the (manual, notebook-based) execution phase, I will echo that the feature(s) discussed in https://github.com/pangeo-forge/pangeo-forge-recipes/issues/97 and https://github.com/pangeo-forge/pangeo-forge-recipes/issues/136 would presumably be useful even in manual execution settings.

My workaround was to estimate the source sizes as follows:

import numpy as np
import xarray as xr

for store in list(urls): # `urls` is a dictionary mapping of 'store_name' : source_url
    ds = xr.open_dataset(urls[store][10]) # an arbitrary source file from each dataset
    gbs = ds.nbytes/1e9
    total_gbs = len(urls[store]) * gbs
    print(f"{store} contains approx. {np.trunc(total_gbs)} GBs.")

which returns:

NASA-SMAP-SSS/JPL/8day contains approx. 81.0 GBs.
NASA-SMAP-SSS/JPL/monthly contains approx. 2.0 GBs.
NASA-SMAP-SSS/RSS/8day contains approx. 110.0 GBs.
NASA-SMAP-SSS/RSS/monthly contains approx. 3.0 GBs.

Based on this information, I decided to start by trying to execute the (considerably smaller) monthly recipes only, using as reference the notebook Ryan has used manually execute an eNATL60 recipe (see https://github.com/pangeo-forge/staged-recipes/pull/24#issuecomment-838757087). The notebook is not currently linkable in full as it contains secrets.

On the execution cell

for recipe_key, r in recipes.items():
    if 'monthly' in recipe_key:
        try:
            r.open_target()
            print(f"found {recipe_key}")
        except:
            print(f"RUNNING {recipe_key}")
            pl = r.to_pipelines()
            plan = executor.pipelines_to_plan(pl)
            executor.execute_plan(plan)
    else:
        pass

I encountered the following errors:

on try block: GroupNotFoundError: group not found at path '', possibly related to https://github.com/pydata/xarray/issues/2586
on except block: ValueError: Got more bytes so far (>15260565) than requested (15242880), possibly related to https://github.com/intake/filesystem_spec/issues/160

I do not expect these issues will be diagnosable without the full notebook context, but I'm logging this in outline form here as a touchpoint nonetheless. Ryan and I will be discussing synchronously on Monday, after which I will follow up on this thread with any generalizable takeaways.

May 14 '21 21:05 cisaacstern

Charles, yesterday we boiled this error down to a specific issue with fsspec. Would you mind sharing that code snippet here?

May 19 '21 13:05 rabernat

Yes, the error was being thrown by line 40 in storage.py here. The minimal example below recreates the error using fsspec.open() alone. (Traceback is included immediately below the example.)

As suggested in https://github.com/intake/filesystem_spec/issues/160#issuecomment-543803897, I was able to resolve this error by setting fsspec_open_kwargs = {'block_size': 0} when instantiating the recipe here. (In the minimal example, uncommenting open_kwargs achieves the same end.)

@martindurant, my lingering questions are:

Is setting fsspec_open_kwargs = {'block_size': 0} indeed your recommended solution to this problem? Or have I overlooked some disadvantage of this solution?
You note in the above-linked comment that this error arises when "fsspec would like to be able to random access the file by issuing Range requests, but the server doesn't respect this". Does the Traceback below stem from the same circumstance?
If so, is there any way to anticipate which source file servers will struggle in this way?
Should I link this report to any ongoing fsspec Issues?

from contextlib import contextmanager
from typing import Any, Iterator

import fsspec

# fsspec doesn't provide type hints, so I'm not sure what the write type is for open files
OpenFileType = Any

@contextmanager
def _fsspec_safe_open(fname: str, **kwargs) -> Iterator[OpenFileType]:
    # workaround for inconsistent behavior of fsspec.open
    # https://github.com/intake/filesystem_spec/issues/579
    with fsspec.open(fname, **kwargs) as fp:
        with fp as fp2:
            yield fp2

base = 'https://podaac-opendap.jpl.nasa.gov/opendap/allData/'
fname = base + 'smap/L3/JPL/V5.0/8day_running/2015/120/SMAP_L3_SSS_20150504_8DAYS_V5.0.nc'

# open_kwargs = {'block_size': 0}

input_opener = _fsspec_safe_open(fname, mode="rb") #, **open_kwargs)

BLOCK_SIZE=10_000_000

with input_opener as source:
    data = source.read(BLOCK_SIZE)

Traceback (click to expand)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-efd8fb2f2463> in <module>
     25 
     26 with input_opener as source:
---> 27     data = source.read(BLOCK_SIZE)

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/implementations/http.py in read(self, length)
    482         else:
    483             length = min(self.size - self.loc, length)
--> 484         return super().read(length)
    485 
    486     async def async_fetch_all(self):

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/spec.py in read(self, length)
   1447             # don't even bother calling fetch
   1448             return b""
-> 1449         out = self.cache._fetch(self.loc, self.loc + length)
   1450         self.loc += len(out)
   1451         return out

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/caching.py in _fetch(self, start, end)
    374         ):
    375             # First read, or extending both before and after
--> 376             self.cache = self.fetcher(start, bend)
    377             self.start = start
    378         elif start < self.start:

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
     70     def wrapper(*args, **kwargs):
     71         self = obj or args[0]
---> 72         return sync(self.loop, func, *args, **kwargs)
     73 
     74     return wrapper

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/asyn.py in sync(loop, func, timeout, *args, **kwargs)
     51     event.wait(timeout)
     52     if isinstance(result[0], BaseException):
---> 53         raise result[0]
     54     return result[0]
     55 

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout)
     18         coro = asyncio.wait_for(coro, timeout=timeout)
     19     try:
---> 20         result[0] = await coro
     21     except Exception as ex:
     22         result[0] = ex

~/.pyenv/versions/anaconda3-2019.10/envs/pangeo-forge3.8/lib/python3.8/site-packages/fsspec/implementations/http.py in async_fetch_range(self, start, end)
    544                         cl += len(chunk)
    545                         if cl > end - start:
--> 546                             raise ValueError(
    547                                 "Got more bytes so far (>%i) than requested (%i)"
    548                                 % (cl, end - start)

ValueError: Got more bytes so far (>15252381) than requested (15242880)

May 19 '21 21:05 cisaacstern

Noting that the PR referenced in the last commit is actually https://github.com/pangeo-forge/roadmap/pull/22, not the one linked in the commit message.

May 19 '21 23:05 cisaacstern

@sharkinsspatial, this is ready to be test-run through the bakery.

I've already manually executed the copy_pruned() versions of all the recipes contained in this PR's dict_object to Pangeo's OSN bucket. The plot below was created with this code block (credentials omitted, of course) at the bottom of the notebook.

Will there soon be a slash command that allows us to do a "test-bake" on the pruned subsets? (Apologies if the timeline on this was obvious from our other threads, still wrapping my head around all the layers here.)

cc @jbusecke, getting close!

May 20 '21 04:05 cisaacstern

Is setting fsspec_open_kwargs = {'block_size': 0} indeed your recommended solution to this problem?

This is saying "I want to view the whole file as a block" and will work fine. Really, the code is doing fs.get (not open/read), which should always do the right thing and also allow concurrent fetches.

You note in the above-linked comment that this error arises when "fsspec would like to be able to random access the file by issuing Range requests, but the server doesn't respect this". Does the Traceback below stem from the same circumstance?

Yes, probably. It is marginally possible (but not likely) that the server is not respecting the content encoding. The response header would have more information.

If so, is there any way to anticipate which source file servers will struggle in this way?

I'm afraid not. The HTTP response to HEAD or GET (before starting to download) might have useful markers, but this already depends on the server being well-behaved. Essentially, none of the header info keys are strictly required.

Should I link this report to any ongoing fsspec Issues?

There have certainly been ongoing conversations around this kind of thing, and the range of circumstances that fsspec can handle has steadily grown.

May 20 '21 12:05 martindurant

The plot below was created with this code block (credentials omitted, of course)

The OSN bucket is public for read only access. You can access it over s3 protocol with anon=True (see my OSN guide) or even http via `https://ncsa.osn.xsede.org/Pangeo/...'

May 20 '21 13:05 rabernat

I'm afraid not. The HTTP response to HEAD or GET (before starting to download) might have useful markers, but this already depends on the server being well-behaved. Essentially, none of the header info keys are strictly required.

Then let's try to explicitly catch this error in Pangeo forge and raise a detailed error message with the suggested workaround.

May 20 '21 13:05 rabernat

/run-recipe-test

May 20 '21 18:05 sharkinsspatial

@cisaacstern Can you include a pangeo_notebook_version at the the root of your meta.yaml. You can use this as an example https://github.com/pangeo-forge/staged-recipes/pull/36/files#diff-743ac37f3dbeb14ebdd6b873ade997238195d5652d365a37c52358662b001c6dR4. We use this to pin the image used by our bakery workers.

May 20 '21 19:05 sharkinsspatial

@cisaacstern As a note. In the short interim while we wait for a release of pangeo-forge-recipes including copy_pruned I'll register these recipes with the CI workflow and attempt to run one of smaller monthly recipes for validation.

May 20 '21 19:05 sharkinsspatial

/run-recipe-test

May 20 '21 19:05 sharkinsspatial

/run-recipe-test

May 20 '21 20:05 sharkinsspatial

/run-recipe-test

May 21 '21 03:05 sharkinsspatial

/run-recipe-test

May 21 '21 16:05 sharkinsspatial

@cisaacstern https://github.com/sharkinsspatial/zarr_examples/blob/main/nasa-smap-sss-jpl-monthly.ipynb 🎊

May 21 '21 18:05 sharkinsspatial

@jbusecke, the first two timesteps of each of the four datasets (two time intervals for each of two algorithms) are available on OSN as follows:

import s3fs
endpoint_url = 'https://ncsa.osn.xsede.org'
fs_osn = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': endpoint_url},)

fs_osn.ls("Pangeo/pangeo-forge/NASA-SMAP-SSS/JPL")

['Pangeo/pangeo-forge/NASA-SMAP-SSS/JPL/8day_pruned.zarr',
 'Pangeo/pangeo-forge/NASA-SMAP-SSS/JPL/monthly_pruned.zarr']

fs_osn.ls("Pangeo/pangeo-forge/NASA-SMAP-SSS/RSS")

['Pangeo/pangeo-forge/NASA-SMAP-SSS/RSS/8day_pruned.zarr',
 'Pangeo/pangeo-forge/NASA-SMAP-SSS/RSS/monthly_pruned.zarr']

@sharkinsspatial, were the complete time series ever built by the bakery, and if so are they publicly accessible somewhere?

Jun 09 '21 18:06 cisaacstern

Could we try re-running this recipe in our latest infrastructure?

Mar 03 '22 17:03 rabernat

Yes I'll change the bakery in meta.yaml, which once committed will signal the bot to create a new recipe run for this

Mar 03 '22 17:03 cisaacstern

It looks like your meta.yaml does not conform to the specification.

            4 validation errors for MetaYaml
recipes -> 0 -> id
  field required (type=value_error.missing)
recipes -> 0 -> object
  field required (type=value_error.missing)
recipes
  value is not a valid dict (type=type_error.dict)
maintainers -> 0 -> orcid
  field required (type=value_error.missing)

Please correct your meta.yaml and commit the corrections to this PR.

Mar 03 '22 17:03 pangeo-forge-bot

The bot doesn't understand dict_objects yet... I'm going to see if I can quickly fix that...

Mar 03 '22 17:03 cisaacstern

A-ha. So it remains true that the bot does not understand dict_objects, but the validation error we see in https://github.com/pangeo-forge/staged-recipes/pull/31#issuecomment-1058307750 is actually because our meta.yaml currently gives

recipes:
   - dict_object: "recipe:recipes"

when it should be a simple mapping (rather than a list), i.e.

recipes:
   dict_object: "recipe:recipes"

I've noted this issue in https://github.com/pangeo-forge/roadmap/pull/49

Mar 03 '22 18:03 cisaacstern

pre-commit.ci autofix

Oct 24 '22 21:10 andersy005

Thanks for working on this @andersy005. Let me know if you need any input from my side!

Oct 25 '22 14:10 jbusecke

/run NASA-SMAP-SSS/RSS/monthly

Nov 02 '22 16:11 jbusecke

:tada: The test run of NASA-SMAP-SSS/RSS/monthly at e10df6b509ff5bf915de32b06f84ecb050783fad succeeded!

import xarray as xr

store = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1366/NASA-SMAP-SSS/RSS/monthly.zarr"
ds = xr.open_dataset(store, engine='zarr', chunks={})
ds

Nov 02 '22 16:11 pangeo-forge[bot]

Seems like the data is not properly concatenated in time. There is a time dimension, but the data itself has no time dimensions?

Nov 02 '22 16:11 jbusecke

/run NASA-SMAP-SSS/JPL/8day

Nov 02 '22 16:11 jbusecke

staged-recipes staged-recipes copied to clipboard

NASA SMAP SSS recipe

staged-recipes
staged-recipes copied to clipboard