intake-xarray
intake-xarray copied to clipboard
Subsetting before Caching
Question: I am wondering if it is possible to subset a dataset (via .sel method) before the data is cached.
Reasoning: My use case is - I would like to cache all the landsat8 data from the s3 repository for a small research (~10km * 10km) station. Currently my catalog looks like (note this subsets the landsat8 tiffs to 60 total - but eventually I would want to use the entire timeseries):
plugins:
source:
- module: intake_xarray
sources:
landsat8:
description: Geotiff image of Landsat8 - TESTING.
driver: rasterio
cache:
- argkey: urlpath
regex: 'landsat-pds/c1/L8/033/031/'
type: file
args:
urlpath: 's3://landsat-pds/c1/L8/033/031/LC08_L1TP_033031_{collection_date:%Y%m%d}_20170310_01_T1/LC08_L1TP_033031_{collection_date:%Y%m%d}_20170310_01_T1_B{band:d}.TIF'
chunks:
band: 1
x: 1000
y: 1000
concat_dim: band
storage_options: {'anon': True}
Being able to subset before caching would reduce the amount of storage significantly (see example below for my 60 tiff subset):
c = intake.open_catalog('test.yml')
l8 = c.landsat8
l8_r = l8.read_chunked()
l8_r.data.nbytes*1e-9
133.64736048
l8_r_sm = l8_r.sel(x=slice(minx,maxx),y=slice(miny,maxy))
l8_r_sm.nbytes*1e-9
0.19845552000000002
Is there currently a way of doing this? If not, how difficult would it be to add this functionality? Is it adding an extra argument to the intake-xarray driver (which would look like: slice: {x:(xmin,xmax),y:(ymin,ymax)} in the catalog) or would it need to include modifications to the caching mechanisms deeper in intake?
Thanks!
Sorry for not replying sooner. The current caching mechanism does not cover this case, it operates before handing the paths to xarray and gets all or none of the files. One options would be to use the experimental caching file system, which keeps local copies of only files which are actually accessed. and only those parts which were accessed. I am not sure if loading to xarray involves touching all of the files, though. Something similar and custom could be written with a little effort; but more typical in the Intake context is to write a driver which handles the subset as an extra parameter to produce a file-list and xarray of only the files of interest.
Martin - thanks for your input!
In my use case, the region of interest is much smaller (geographically) than a single geotiff image. Therefore, I am unable to subset based on a file-list. Please let me know if I am misinterpreting your suggestion.
The experimental CachingFileSystem option would be excellent assuming xarray does not trigger an entire file download... At the moment I am having an issue getting it to work with intake. Below is my current effort with traceback... will the fsspec class be able to "drop-in" as a new caching mechanism?
import intake
from fsspec.implementations import cached
CachingFileSystem = cached.CachingFileSystem
intake.source.cache.registry['file_system']=CachingFileSystem
intake.source.cache.registry
{'file': intake.source.cache.FileCache, 'dir': intake.source.cache.DirCache, 'compressed': intake.source.cache.CompressedCache, 'dat': intake.source.cache.DATCache, 'file_system': fsspec.implementations.cached.CachingFileSystem}
c = intake.open_catalog('test.yml')
l8 = c.landsat8
l8 = l8.to_dask()
Traceback:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-655b61aa6a66> in <module>
1 l8 = c.landsat8
----> 2 l8 = l8.to_dask()
/opt/conda/lib/python3.7/site-packages/intake/catalog/entry.py in __getattr__(self, attr)
119 return self.__dict__[attr]
120 else:
--> 121 return getattr(self._get_default_source(), attr)
122
123 def __dir__(self):
/opt/conda/lib/python3.7/site-packages/intake/catalog/entry.py in _get_default_source(self)
91 """Instantiate DataSource with default agruments"""
92 if self._default_source is None:
---> 93 self._default_source = self()
94 return self._default_source
95
/opt/conda/lib/python3.7/site-packages/intake/catalog/entry.py in __call__(self, persist, **kwargs)
76 raise ValueError('Persist value (%s) not understood' % persist)
77 persist = persist or self._pmode
---> 78 s = self.get(**kwargs)
79 if s.has_been_persisted and persist is not 'never':
80 s2 = s.get_persisted()
/opt/conda/lib/python3.7/site-packages/intake/catalog/local.py in get(self, **user_parameters)
274 """Instantiate the DataSource for the given parameters"""
275 plugin, open_args = self._create_open_args(user_parameters)
--> 276 data_source = plugin(**open_args)
277 data_source.catalog_object = self._catalog
278 data_source.name = self.name
/opt/conda/lib/python3.7/site-packages/intake_xarray/raster.py in __init__(self, urlpath, chunks, concat_dim, xarray_kwargs, metadata, path_as_pattern, **kwargs)
51 self._kwargs = xarray_kwargs or {}
52 self._ds = None
---> 53 super(RasterIOSource, self).__init__(metadata=metadata)
54
55 def _open_files(self, files):
/opt/conda/lib/python3.7/site-packages/intake/source/base.py in __init__(self, metadata)
79 catdir=self.metadata.get('catalog_dir',
80 None),
---> 81 storage_options=storage_options)
82 self.datashape = None
83 self.dtype = None
/opt/conda/lib/python3.7/site-packages/intake/source/cache.py in make_caches(driver, specs, catdir, cache_dir, storage_options)
555 driver, spec, catdir=catdir, cache_dir=cache_dir,
556 storage_options=storage_options)
--> 557 for spec in specs]
/opt/conda/lib/python3.7/site-packages/intake/source/cache.py in <listcomp>(.0)
555 driver, spec, catdir=catdir, cache_dir=cache_dir,
556 storage_options=storage_options)
--> 557 for spec in specs]
TypeError: __init__() got an unexpected keyword argument 'catdir'
Catalog:
plugins:
source:
- module: intake_xarray
sources:
landsat8:
description: Geotiff image of Landsat Surface Reflectance Level-2 Science Product L5.
driver: rasterio
cache:
- argkey: urlpath
regex: 'landsat-pds/c1/L8/033/031/'
type: file_system
args:
urlpath: 's3://landsat-pds/c1/L8/033/031/LC08_L1TP_033031_{collection_date:%Y%m%d}_20170310_01_T1/LC08_L1TP_033031_{collection_date:%Y%m%d}_20170310_01_T1_B{band:d}.TIF'
chunks:
band: 1
x: 1000
y: 1000
concat_dim: band
storage_options: {'anon': True}
The CachingFileSystem
is not an Intake cache implementation, but lower-level, providing python file-like objects. It could be made into a type of cache for Intake too, I hadn't even got that far. Unfortunately, rasterio can't read from python file-like objects, it needs a real, local file - which is exactly why the Intake caching scheme was designed to download whole files.
At the moment, I think the only solution would be to use this filesystem implementation, or s3 directly, via FUSE - which is usually painful and slower than you might think. Maybe worth a go, though?
It does give me idea: https://github.com/intake/filesystem_spec/issues/102 , which would not be too hard, I think.
I think for zarr only the chunks requested are actually cached. For nc files the whole file is cached. @observingClouds