staged-recipes icon indicating copy to clipboard operation
staged-recipes copied to clipboard

Draft for FIO-COM32 recipe

Open roxyboy opened this issue 4 years ago • 6 comments

This will (probably) be the last model to be added to the regional SWOT-AdAC data set. The outputs are from the ocean model of First Institute of Oceanography, China.

roxyboy avatar Dec 03 '21 16:12 roxyboy

@cisaacstern The sftp server is password protected. Can I send you the password?

roxyboy avatar Dec 03 '21 16:12 roxyboy

Can I send you the password?

Yep! https://keybase.io/cstern

cisaacstern avatar Dec 03 '21 16:12 cisaacstern

Yep! https://keybase.io/cstern

I sent you it on Keybase.

roxyboy avatar Dec 03 '21 17:12 roxyboy

@martindurant, with fsspec 2021.11.1 caching throughput over sftp for this recipe was in the range of 0.06 - 0.1 MB/s.

https://github.com/paramiko/paramiko/issues/175 lead me to try f.prefetch() in SFTPFileSystem._open, which increased throughput ~ 30 fold into the range of 2.7 - 3.4 MB/s. This appears to be more-or-less the upper threshold imposed by the server, as it matches what I saw over an sftp interactive session in the terminal:

sftp> get Region01-hourly-v.FMA.nc
Fetching /shared/SWOT-AdAC/Region01-hourly-v.FMA.nc to Region01-hourly-v.FMA.nc
01:06/shared/SWOT-AdAC/Region01-hourly-v.FMA.nc                       31%  125MB   2.5MB/s   01:49 ETA

Are you open to my submitting this line as an fsspec PR? I'd be curious to know if there's any reason why someone would not want to prefetch. If such cases exist, I suppose we could do

if prefetch:
    f.prefetch()

and users could pass prefetch as an fsspec kwarg (given that paramiko doesn't appear to support a prefetch kwarg itself).

Note, I tried most other fixes I could find on the paramiko Issue board, including manipulating default_window_size, compression, etc., but this was the only change which appeared to make any noticeable difference for this case.

cisaacstern avatar Dec 09 '21 19:12 cisaacstern

@roxyboy, as of https://github.com/pangeo-data/swot_adac_ogcms/pull/10, the FIO-COM32 datasets are now accessible via our project catalog. I've updated the example notebook to demonstrate accessing these data:

https://github.com/pangeo-data/swot_adac_ogcms/blob/main/intake_demo.ipynb

The access pattern is the same as for all of the other datasets in our catalog, so should be quite familiar to you by now, but of course just let me know if any issues arise.

In terms of the recipe itself, I ended up pushing a relatively extensive rewrite of the initial draft in https://github.com/pangeo-forge/staged-recipes/pull/97/commits/4cbe5779a49ff8133cc9b23d591cb494e81fbb5d. In addition to tweaking input subsets and target chunks, the main focus of this edit was to make sure that all of the input variables were included for each dataset. In your original draft of the recipe you used pattern_from_file_sequence to assemble the inputs, however this convenience method is only designed for concatenating inputs, not merging them. To get the merging which, IIUC, is required for this recipe, I rewrote the recipe with the longer-form MergeDim, FilePattern, etc. approach.

For the sake of illustration, note the difference between the inputs in the original and current recipe versions below:

from original import recipes as original_recipes
from recipe import recipes as current_recipes

recipe_key = 'FIO-COM32/Region01/surface_hourly/fma'
original_recipe = original_recipes[recipe_key.replace("fma", "FMA")]
current_recipe = current_recipes[recipe_key]

for version, recipe in dict(original=original_recipe, current=current_recipe).items():
    print(f"\n{version}:")
    for k, v in recipe.file_pattern.items():
        print(k, v)
original:
time-0 sftp://[email protected]:/shared/SWOT-AdAC/Region01-hourly-v.FMA.nc

current:
time-0,variable-0 sftp://[email protected]:/shared/SWOT-AdAC/Region01-hourly-salt.FMA.nc
time-0,variable-1 sftp://[email protected]:/shared/SWOT-AdAC/Region01-hourly-temp.FMA.nc
time-0,variable-2 sftp://[email protected]:/shared/SWOT-AdAC/Region01-hourly-ssh.FMA.nc
time-0,variable-3 sftp://[email protected]:/shared/SWOT-AdAC/Region01-hourly-u.FMA.nc
time-0,variable-4 sftp://[email protected]:/shared/SWOT-AdAC/Region01-hourly-v.FMA.nc

You can see that the input generated by the original approach is missing the full variable scope. The current inputs are reflective of the recipe used to build the Zarr stores in the swot_adac_ogcms project catalog.

Once you gotten a chance to review the FIO-COM32 Zarr stores, could you confirm that these datasets look the way you'd expect them to? (I'm pretty confident this this variable merging rewrite is how the data should look, but I haven't cross-referenced it with you yet, so it's possible I'm overlooking something.)

cisaacstern avatar Dec 21 '21 00:12 cisaacstern

You can see that the input generated by the original approach is missing the full variable scope. The current inputs are reflective of the recipe used to build the Zarr stores in the swot_adac_ogcms project catalog.

Once you gotten a chance to review the FIO-COM32 Zarr stores, could you confirm that these datasets look the way you'd expect them to? (I'm pretty confident this this variable merging rewrite is how the data should look, but I haven't cross-referenced it with you yet, so it's possible I'm overlooking something.)

@cisaacstern Thanks as always for the technical support and fluxing of data. I've been able to read and plot the FIOCOM data! :) I'll keep in mind about the merging files for future reference.

roxyboy avatar Dec 21 '21 17:12 roxyboy