Draft for FIO-COM32 recipe
This will (probably) be the last model to be added to the regional SWOT-AdAC data set. The outputs are from the ocean model of First Institute of Oceanography, China.
@cisaacstern The sftp server is password protected. Can I send you the password?
Can I send you the password?
Yep! https://keybase.io/cstern
Yep! https://keybase.io/cstern
I sent you it on Keybase.
@martindurant, with fsspec 2021.11.1 caching throughput over sftp for this recipe was in the range of 0.06 - 0.1 MB/s.
https://github.com/paramiko/paramiko/issues/175 lead me to try f.prefetch() in SFTPFileSystem._open, which increased throughput ~ 30 fold into the range of 2.7 - 3.4 MB/s. This appears to be more-or-less the upper threshold imposed by the server, as it matches what I saw over an sftp interactive session in the terminal:
sftp> get Region01-hourly-v.FMA.nc
Fetching /shared/SWOT-AdAC/Region01-hourly-v.FMA.nc to Region01-hourly-v.FMA.nc
01:06/shared/SWOT-AdAC/Region01-hourly-v.FMA.nc 31% 125MB 2.5MB/s 01:49 ETA
Are you open to my submitting this line as an fsspec PR? I'd be curious to know if there's any reason why someone would not want to prefetch. If such cases exist, I suppose we could do
if prefetch:
f.prefetch()
and users could pass prefetch as an fsspec kwarg (given that paramiko doesn't appear to support a prefetch kwarg itself).
Note, I tried most other fixes I could find on the paramiko Issue board, including manipulating
default_window_size, compression, etc., but this was the only change which appeared to make any noticeable difference for this case.
@roxyboy, as of https://github.com/pangeo-data/swot_adac_ogcms/pull/10, the FIO-COM32 datasets are now accessible via our project catalog. I've updated the example notebook to demonstrate accessing these data:
https://github.com/pangeo-data/swot_adac_ogcms/blob/main/intake_demo.ipynb
The access pattern is the same as for all of the other datasets in our catalog, so should be quite familiar to you by now, but of course just let me know if any issues arise.
In terms of the recipe itself, I ended up pushing a relatively extensive rewrite of the initial draft in https://github.com/pangeo-forge/staged-recipes/pull/97/commits/4cbe5779a49ff8133cc9b23d591cb494e81fbb5d. In addition to tweaking input subsets and target chunks, the main focus of this edit was to make sure that all of the input variables were included for each dataset. In your original draft of the recipe you used pattern_from_file_sequence to assemble the inputs, however this convenience method is only designed for concatenating inputs, not merging them. To get the merging which, IIUC, is required for this recipe, I rewrote the recipe with the longer-form MergeDim, FilePattern, etc. approach.
For the sake of illustration, note the difference between the inputs in the original and current recipe versions below:
from original import recipes as original_recipes
from recipe import recipes as current_recipes
recipe_key = 'FIO-COM32/Region01/surface_hourly/fma'
original_recipe = original_recipes[recipe_key.replace("fma", "FMA")]
current_recipe = current_recipes[recipe_key]
for version, recipe in dict(original=original_recipe, current=current_recipe).items():
print(f"\n{version}:")
for k, v in recipe.file_pattern.items():
print(k, v)
original:
time-0 sftp://[email protected]:/shared/SWOT-AdAC/Region01-hourly-v.FMA.nc
current:
time-0,variable-0 sftp://[email protected]:/shared/SWOT-AdAC/Region01-hourly-salt.FMA.nc
time-0,variable-1 sftp://[email protected]:/shared/SWOT-AdAC/Region01-hourly-temp.FMA.nc
time-0,variable-2 sftp://[email protected]:/shared/SWOT-AdAC/Region01-hourly-ssh.FMA.nc
time-0,variable-3 sftp://[email protected]:/shared/SWOT-AdAC/Region01-hourly-u.FMA.nc
time-0,variable-4 sftp://[email protected]:/shared/SWOT-AdAC/Region01-hourly-v.FMA.nc
You can see that the input generated by the original approach is missing the full variable scope. The current inputs are reflective of the recipe used to build the Zarr stores in the swot_adac_ogcms project catalog.
Once you gotten a chance to review the FIO-COM32 Zarr stores, could you confirm that these datasets look the way you'd expect them to? (I'm pretty confident this this variable merging rewrite is how the data should look, but I haven't cross-referenced it with you yet, so it's possible I'm overlooking something.)
You can see that the input generated by the
originalapproach is missing the full variable scope. Thecurrentinputs are reflective of the recipe used to build the Zarr stores in theswot_adac_ogcmsproject catalog.Once you gotten a chance to review the FIO-COM32 Zarr stores, could you confirm that these datasets look the way you'd expect them to? (I'm pretty confident this this variable merging rewrite is how the data should look, but I haven't cross-referenced it with you yet, so it's possible I'm overlooking something.)
@cisaacstern Thanks as always for the technical support and fluxing of data. I've been able to read and plot the FIOCOM data! :) I'll keep in mind about the merging files for future reference.