staged-recipes
staged-recipes copied to clipboard
Add CESM2-LE pipeline
Closes #51
I added a couple files which @cisaacstern worked through this morning. This is preliminary for now, and this can only be run within the GLADE filesystem at NCAR since the data are there, but I am hoping this will at least provide an example!
Check out this pull request onΒ ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Awesome, thanks @mgrover1! Recapping here for clarity that our plan was to "skip" caching, because you already have access to all of the source files on GLADE. To implement this, we initially instantiated your source file directory as a CacheFSSpecTarget object. This raised the issue that your source filenames do not include the prefix added by pangeo-forge-recipes.storage here.
In https://github.com/pangeo-forge/staged-recipes/pull/53/commits/e377b76f09d079744111e53a7c5d40c358fdbc95, I changed your source file target to an instance of FSSpecTarget. (Btw, I pushed this commit directly to your PR branch π§ π .) FSSpecTarget is a parent class of CacheFSSpecTarget, in which the (in this case) problematic prefix is not added to the file paths (as seen here, fwiw).
So I'm curious, if you execute your recipe from this updated execution notebook, do you still get a FileNotFoundError when you call recipe.prepare_target()?
thanks @cisaacstern ! Now I am running into this
AttributeError: 'FSSpecTarget' object has no attribute 'getitems'
when running recipe.prepare_target()
Progress! (I hope :smile:)
Now I am running into this
AttributeError: 'FSSpecTarget' object has no attribute 'getitems'
Can you provide a full Traceback?
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-14-bbcc9bf6cc50> in <module>
----> 1 recipe.prepare_target()
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in prepare_target(self)
264 # Regardless of whether there is an existing dataset or we are creating a new one,
265 # we need to expand the concat_dim to hold the entire expected size of the data
--> 266 input_sequence_lens = self.calculate_sequence_lens()
267 n_sequence = sum(input_sequence_lens)
268 logger.info(f"Expanding target concat dim '{self._concat_dim}' to size {n_sequence}")
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in calculate_sequence_lens(self)
476 # get the sequence length of every file
477 # this line could become problematic for large (> 10_000) lists of files
--> 478 input_meta = self.get_input_meta(*self._inputs_chunks)
479 # use a numpy array to allow reshaping
480 all_lens = np.array([m["dims"][self._concat_dim] for m in input_meta.values()])
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in get_input_meta(self, *input_keys)
462 if self.metadata_cache is None:
463 raise ValueError("metadata_cache is not set.")
--> 464 return self.metadata_cache.getitems([_input_metadata_fname(k) for k in input_keys])
465
466 def input_position(self, input_key):
AttributeError: 'FSSpecTarget' object has no attribute 'getitems'
I think you want pangeo_forge_recipes.storage.MetadataTarget. I hit / am fixing this in the tutorials in https://github.com/pangeo-forge/pangeo-forge-recipes/pull/160.
Adding this in instead
import tempfile
from fsspec.implementations.local import LocalFileSystem
from pangeo_forge_recipes.storage import FSSpecTarget, CacheFSSpecTarget
from pangeo_forge_recipes.storage import MetadataTarget
fs_local = LocalFileSystem()
cache_dir = tempfile.TemporaryDirectory()
# cache_target = CacheFSSpecTarget(fs_local, direct)
cache_target = FSSpecTarget(fs_local, direct)
#target_dir = tempfile.TemporaryDirectory()
target = FSSpecTarget(fs_local, target_dir)
meta_dir = tempfile.TemporaryDirectory()
meta_store = MetadataTarget(fs_local, meta_dir.name)
recipe.input_cache = cache_target
recipe.target = target
recipe.metadata_cache = meta_store
cache_target.root_path, target.root_path, meta_store.root_path
results in
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/mapping.py in getitems(self, keys, on_error)
89 try:
---> 90 out = self.fs.cat(keys2, on_error=oe)
91 except self.missing_exceptions as e:
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/spec.py in cat(self, path, recursive, on_error, **kwargs)
718 try:
--> 719 out[path] = self.cat_file(path, **kwargs)
720 except Exception as e:
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/spec.py in cat_file(self, path, start, end, **kwargs)
658 # explicitly set buffering off?
--> 659 with self.open(path, "rb", **kwargs) as f:
660 if start is not None:
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/spec.py in open(self, path, mode, block_size, cache_options, **kwargs)
967 cache_options=cache_options,
--> 968 **kwargs,
969 )
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/implementations/local.py in _open(self, path, mode, block_size, **kwargs)
131 self.makedirs(self._parent(path), exist_ok=True)
--> 132 return LocalFileOpener(path, mode, fs=self, **kwargs)
133
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/implementations/local.py in __init__(self, path, mode, autocommit, fs, **kwargs)
219 self.blocksize = io.DEFAULT_BUFFER_SIZE
--> 220 self._open()
221
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/implementations/local.py in _open(self)
224 if self.autocommit or "w" not in self.mode:
--> 225 self.f = open(self.path, mode=self.mode)
226 else:
FileNotFoundError: [Errno 2] No such file or directory: '/glade/scratch/mgrover/tmpnqia56i9/input-meta-0.json'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-4-bbcc9bf6cc50> in <module>
----> 1 recipe.prepare_target()
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in prepare_target(self)
264 # Regardless of whether there is an existing dataset or we are creating a new one,
265 # we need to expand the concat_dim to hold the entire expected size of the data
--> 266 input_sequence_lens = self.calculate_sequence_lens()
267 n_sequence = sum(input_sequence_lens)
268 logger.info(f"Expanding target concat dim '{self._concat_dim}' to size {n_sequence}")
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in calculate_sequence_lens(self)
476 # get the sequence length of every file
477 # this line could become problematic for large (> 10_000) lists of files
--> 478 input_meta = self.get_input_meta(*self._inputs_chunks)
479 # use a numpy array to allow reshaping
480 all_lens = np.array([m["dims"][self._concat_dim] for m in input_meta.values()])
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in get_input_meta(self, *input_keys)
462 if self.metadata_cache is None:
463 raise ValueError("metadata_cache is not set.")
--> 464 return self.metadata_cache.getitems([_input_metadata_fname(k) for k in input_keys])
465
466 def input_position(self, input_key):
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/pangeo_forge_recipes/storage.py in getitems(self, keys)
161 def getitems(self, keys: Sequence[str]) -> dict:
162 mapper = self.get_mapper()
--> 163 all_meta_raw = mapper.getitems(keys)
164 return {k: json.loads(raw_bytes) for k, raw_bytes in all_meta_raw.items()}
165
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/mapping.py in getitems(self, keys, on_error)
90 out = self.fs.cat(keys2, on_error=oe)
91 except self.missing_exceptions as e:
---> 92 raise KeyError from e
93 out = {
94 k: (KeyError() if isinstance(v, self.missing_exceptions) else v)
KeyError:
I think you want
pangeo_forge_recipes.storage.MetadataTarget. I hit / am fixing this in the tutorials in pangeo-forge/pangeo-forge-recipes#160.
Amazing catch. And oops! This wouldn't have come up for Max if I had resolved: https://github.com/pangeo-forge/pangeo-forge-recipes/issues/135#issuecomment-840719075 π€
Adding that in instead results in
Yep, that's expected because there is also one other issue here, which is that we haven't actually cached any metadata (because we skipped caching). I am about to push a commit which should address this.
@mgrover1, I don't know if it would've worked to cache metadata to a TemporaryDirectory, but just to be safe I wrote https://github.com/pangeo-forge/staged-recipes/pull/53/commits/e6f62f81811efa31db2b8a5308f7d9b9584e8a30 as if you'd made a new directory called '/glade/scratch/mgrover/cesm2-le-metadata' and then instantiated a MetadataTarget with that path.
Then, before preparing the target, I've added:
for input_name in recipe.iter_inputs():
recipe.cache_input_metadata(input_name)
Can you see where running these changes before the call to recipe.prepare_target() gets us?
@cisaacstern we are in business πππ

Now the question is:
- How can I automate this for an entire catalog of output?
- Would there be a good way to separate out the static grid variables (ex. hyam, hybi, etc.)
The first question may be able to be solved within the make_full_path but do I really need this if I have the full path from the intake-esm catalog?
https://github.com/pangeo-forge/staged-recipes/pull/53/commits/d1193c12d69dfdbd1298ba84c9b8b3420149f982 adds the store_chunks and finalize_target steps. Without these, you just have the first time step (which is written in prepare_target).
When running this, I run into the following warning
/glade/u/home/mgrover/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/xarray/conventions.py:207: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
SerializationWarning,
Is this something to be concerned about when running this on the larger 1 TB+ zarr stores I plan on running this on?
Here is an example of what the make_filename would look like:
def make_filename(component, frequency, variable, experiment, forcing, experiment_number, member_id, stream, time):
return f"/glade/campaign/cgd/cesm/CESM2-LE/timeseries/{component}/proc/tseries/{frequency}/{variable}/b.e21.{experiment}{forcing}.f09_g17.LE2-{experiment_number}.{member_id}.{stream}.{variable}.{time}.nc"
When running this, I run into the following warning
/glade/u/home/mgrover/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/xarray/conventions.py:207: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it. SerializationWarning,Is this something to be concerned about when running this on the larger 1 TB+ zarr stores I plan on running this on?
I've never seen this before. Seems like a question for @TomAugspurger or @rabernat.
Would there be a good way to separate out the static grid variables (ex. hyam, hybi, etc.)
Are these mirrored across every one of the source files? If so, you may be able to create a separate recipe for them and write them only once. Here it's worth noting that your cesm_le2_recipe.py can instantiate as many recipes as you want, as long as you wrap them all in a dictionary at the bottom of the file. So you could do:
# ... define recipes above, then ...
recipes = {
"historical/atm": historical_atm_recipe, # each dict value is a XarrayZarrRecipe instance
"ssp370/atm": ssp370_atm_recipe,
"grid": grid_recipe,
}
Then in the execution notebook:
from cesm_le2_recipe import recipes
for input_name in recipes["historical_atm"].iter_inputs():
recipes["historical_atm"].cache_input_metadata(input_name)
# ... etc. ...
How can I automate this for an entire catalog of output? ... [this] may be able to be solved within the
make_full_path
Yes! You can add dimensional complexity to your recipe by parameterizing additional components of the path returned from make_full_path. Your mock-up in https://github.com/pangeo-forge/staged-recipes/pull/53#issuecomment-867983934 is on exactly the right track to achieve this.
Then, each of these parameters (aside from time) becomes it's own MergeDim as described here.
but do I really need this if I have the full path from the intake-esm catalog?
Yep, you do need to parameterize these in the make_full_path function, just as you've already started to do in your comment above.
@mgrover1, I note in https://github.com/pangeo-forge/staged-recipes/pull/53#issuecomment-867983934 that you've given a frequency argument which also appears in the recipe here.
Assuming this refers to temporal resolution (monthly, daily, etc.), then each frequency will presumably need to be its own separate zarr store. Unless I'm missing something (which is possible), anything you define as a MergeDim will need to be the same length in the time dimension.
Yes - the zarr stores will be separated by component/frequency/cesm2-le.experiment.forcing.variable.zarr
How's this going, @mgrover1? Anything we can troubleshoot or is everything working as desired?
Just pinging this PR. Is this recipe still viable? Could we run it in our bakery?