Coregistration gets stuck using more and more memory
Expected behavior
Expect to be able to coregister a SST dataset (regional subsetted) with a global cloud dataset
Actual behavior
@forman @JanisGailis Coregister operation starts and uses more and more memory. Eventually laptop freezes. Only way to stop it is to cancel the coregistration operation or end the process using Task Manager. Same happens when global cloud dataset is master and is SST is slave dataset.
Steps to reproduce the problem
-
Start Windows Task Manager, select Processes tab, sort by Memory column in descending order
-
Open cate GUI and download following dataset : esacci.SST.day.L4.SSTdepth.multi-sensor.multi-platform.OSTIA.1-1.r1 Time: from 2004-01-01 to 2005-12-31 Region: lat=[-10,10], lon=[-175,-115] No variable constraints
-
Download following dataset esacci.CLOUD.mon.L3C.CLD_PRODUCTS.multi-sensor.multi-platform.ATSR2-AATSR.2-0.r1 Time: from 2004-01-01 to 2005-01-01 No regional constraints No variable constraints
-
Select coregister operation. ds_master = ds_1 (SST), ds_slave = ds_2 (cloud) method_us, method_ds : (use defaults)
-
Click "Add Step". Look on Task Manager. python.exe process uses more and more memory.
Note
The SST and cloud datasets were the same ones used by @kjpearson in issue #733. In that case he reported (when using cate-2.0.0-dev. 16) that the operation completed but the data for the whole globe in the cloud dataset has been remapped down to the subregion in the SST dataset.
Specifications
cate-2.0.0-dev.20 Windows 7 Professional
@forman @JanisGailis @kjpearson Same problem is seen when coregistering following cloud datasets (without any regional subsetting) esacci.CLOUD.mon.L3C.CLD_PRODUCTS.multi-sensor.multi-platform.ATSR2-AATSR.2-0.r1 [2004-01-01, 2005-01-01] esacci.CLOUD.mon.L3C.CLD_PRODUCTS.AVHRR.multi-platform.AVHRR-PM.2-0.r1 [2004-01-01, 2004-05-01]
@forman I have investigated this. The problem is with using the gridtools library. Apparently, dask doesn't work the way we thought it does. You can not pass a slice to something that makes a new np array and then stitch those together. dask only does the out of core processing on actual calculations being applied to a dask array.
As it currently stands, it looks like this from xarray land:
- There's an
xarraydataset usingdaskas backend. - A tiny slice of it is loaded into memory as a
numpyarray and thrown into a black hole (gridtools). - Out of the black hole comes a new
numpyarray - A new set of
xr.DataArraysare constructed from thesenumpyarrays coming out of the black hole in memory!
There are two possible solutions I have come up with:
-
Rewrite coregistration without relying on
gridtools. E.g., usexarrayanddasknative capabilities.xarraynow has resampling implemented that doesnearest_neighborandbilinearresampling, which could be used for upsampling. For downsampling aggregated rolling operations across dimensions can be used to do tricky things. I've implemented a preliminarynon weighed meandownsampler with it. There's an exploratory branchjg-799-coreg-memhog1.1xarraybuilt in resampling doesn't know how to work acrossdaskchunks. E.g., you can not upsample a large subset (or an unlucky subset) of a finely grained dataset, such as SST using this method. 1.2 Handlingnanvalues is tricky, as manynpoperations meant for working with masked arrays callsnp.copy, which would result in an 'in-memory' dataset again. 1.3. Re-implementing all the functionality we have now due to usinggridtoolswon't be fast, and in some cases will be impossible. -
Use a dirty hack to use
np.memmapas the underlying array structure for coregistered datasets: https://stackoverflow.com/questions/44733067/do-xarray-or-dask-really-support-memory-mapping I haven't tried this yet, but it 'could' work. 2.1. Some time in the future some changes inxarraycould easily break the undocumented features needed for this to work. 2.2. We have to implement additionaltempfile handling. What happens when we save the workflow? Do we save the temp file too? What happens when we save the dataset into a netcdf? Get rid of the tempfile? Where do we put the tempfile across platforms, etc. 2.3. Something might not work or work in an unexpected way due to using an undocumented feature. 2.4 The coregistered dataset will take space on disk. So, doing a coregistration to a very fine grid of a dataset that spans a long time span will silently eat away the available disk space.
In either case, fixing this definitively will not be trivial and will be a significant effort.
See also https://github.com/pydata/xarray/issues/486