satpy Resolution-based chunk sizes for colocated instruments with multiple resolutions

The following was written my @djhoese as a comment in PR:

I'm going to summarize what was discussed on slack to the best of my ability:

I generally disagree and disapprove of the idea of not having chunks be explicitly aligned with the file or segment chunk size. For example, AHI 500m data has 2200 rows per segment so I think that chunk size should be 550 or 1100 or 2200 or some other 2200-aligned value. So I'm not in favor of removing previous_chunks from the current PR, but have other long term suggestions too (see below).

So far the three of us (@simonrp84, @mraspaud, and me) have all runs tests I think, but we're all testing slightly different things too. Simon was doing a true color composite with _nocorr so no rayleigh and no sunz_corrected. Additionally he is doing native resampling to the scn.coarsest_area. Martin I think is doing the true_color (which includes sunz and rayleigh) and likely doing native to the finest area. I have been using my Geo2Grid tool which has default chunk sizes (2200 for AHI) and I believe a different enhancement.

Martin had tracked down a memory usage increase to this commit:

ea980d891ff3284c10d51c73bed0fc5e2487f134 is the first bad commit
commit ea980d891ff3284c10d51c73bed0fc5e2487f134
Author: David Hoese <[email protected]>
Date:   Sun Nov 20 16:54:50 2022 -0600

    Remove unnecessary rechunking of replicated native resampler

 satpy/resample.py | 27 +++++++++------------------
 1 file changed, 9 insertions(+), 18 deletions(-)

This commit was meant to avoid unnecessary rechunking in the cases where the native resampler was replicating pixels from a coarse resolution to a finer resolution. So in terms of tracking down some of the performance issues that were introduced (admittedly by me in this commit) doing native resampling to coarest_area() won't show anything as this commit was primarily concerned about replication (when you do native to finest_area()).

Next, I brought up two concerns with regard to chunk sizes in the reader:

Zarr, which is used for caching lon/lat and angle arrays, currently requires chunks to be regularly sized except for the last chunk. In the past I added a workaround for this in the zarr caching to rechunk arrays-to-be-cached before saving them. This works and avoids and error from zarr but is a performance hit as rechunking has to be done when it would not have if chunk sizes were different.
Native resampling, and any other operations combining different resolutions of bands, generally performs better when the chunk sizes are resolution-based. So if the 500m AHI B03 has a chunk size of 2200 then the 1km bands should have a chunk size of 1100. Then when the native resampler replicates the pixels/arrays they will all be the same chunk size representing the same geographic region. Note, this does not require regular chunking, just a relationship between the bands.

I did some testing with AHI and noted that in the case of main with a chunk size set to something like 2200, you would get the 500m band with chunk sizes of 2200 in both dimensions, but the 1km bands would be 1100 in the row dimension (the size of one segment) and 2200 in the column dimension. This means after native resampling (replication) the column dimensions would be 2200 for 500m (no change) and 4400 for 1km. This means all future operations require dask to hold on to certain arrays and then concatenate/append them together to merge chunks. A waste in performance.

I noted in my tests with different AHI chunk sizes that 2200 would cause a major peak memory usage of ~11GB, but 2048 didn't see a large peak in usage and only a maximum of about 6GB. I then hacked the reader in my local install to do chunking based on resolution (2200 for 500m, 1100 for 1km, 550 for 2km) and memory usage dropped to about a maximum of 5.5GB and much less usage early on (gradual slope rather than sustained 5-6GB). The execution time dropped from ~180s to ~150s. Doing a chunk size starting at 2048 gave similar results.

I then tried doing this resolution based ABI chunking. The execution time increased from about 1m54s to 2m04s, but memory usage was cut in half, from ~8GB to ~4GB. It also had a similar gradual increase in memory versus a mostly sustained usage of memory.

Originally posted by @djhoese in https://github.com/pytroll/satpy/issues/2545#issuecomment-1693872852

Aug 30 '23 08:08 mraspaud

To this, I would like to add that we need a way or a convention to specify the chunk size using dask standards that still makes sense for multiple resolutions. For example, dask allows specifying chunksize limits in test format, eg 32MB. This is really convenient for user imo, however, we need to be careful about this is then translates for mulitple resolutions. is 32MB a min, a max, an average of the chunk size across the different resolutions?

Aug 30 '23 08:08 mraspaud

I would say it should be a max chunk memory.

Aug 30 '23 08:08 simonrp84

Agreed with @simonrp84 and that's how I've been using it in test implementations (I'm not sure I have any readers that actively do this). I think that's how dask approaches the chunk size too. That is, when it chunks an array the max size of a chunk will be that size but there might be smaller chunks off the end. The other main goal would be to control network traffic when running distributed jobs where chunk size could have a major effect on the speed that data gets sent between nodes.

Aug 30 '23 11:08 djhoese

I need some sanity checking. I could have sworn I commented on some issue or pull request with a diff on the AHI HSD reader and maybe the ABI L1b reader to make them do resolution-based chunking, but now I can't find the comment. Does anyone else know where it might be?

Oct 02 '23 02:10 djhoese

It was a comment on slack, here is the diff for AHI HSD, I didn't comment about ABI because iirc it was a hack and needed to be more elegant:

Subject: [PATCH] Make AHI HSD use resolution-based chunking
---
Index: satpy/readers/ahi_hsd.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/satpy/readers/ahi_hsd.py b/satpy/readers/ahi_hsd.py
--- a/satpy/readers/ahi_hsd.py	(revision 6326b0358a8941cb509f3cd14f761d64807c9e3d)
+++ b/satpy/readers/ahi_hsd.py	(date 1692974432108)
@@ -616,13 +616,14 @@
 
         return header
 
-    def _read_data(self, fp_, header):
+    def _read_data(self, fp_, header, resolution):
         """Read data block."""
         nlines = int(header["block2"]['number_of_lines'][0])
         ncols = int(header["block2"]['number_of_columns'][0])
+        chunk_size = CHUNK_SIZE * (500 / resolution)
         return da.from_array(np.memmap(self.filename, offset=fp_.tell(),
                                        dtype='<u2', shape=(nlines, ncols), mode='r'),
-                             chunks=CHUNK_SIZE)
+                             chunks=chunk_size)
 
     def _mask_invalid(self, data, header):
         """Mask invalid data."""
@@ -638,7 +639,7 @@
         """Read the data."""
         with open(self.filename, "rb") as fp_:
             self._header = self._read_header(fp_)
-            res = self._read_data(fp_, self._header)
+            res = self._read_data(fp_, self._header, key["resolution"])
         res = self._mask_invalid(data=res, header=self._header)
         res = self.calibrate(res, key['calibration'])
 
@@ -667,7 +668,7 @@
             units=ds_info['units'],
             standard_name=ds_info['standard_name'],
             wavelength=ds_info['wavelength'],
-            resolution='resolution',
+            resolution=ds_info['resolution'],
             id=key,
             name=key['name'],
             platform_name=self.platform_name,

Oct 02 '23 14:10 djhoese

Here's my experimentation with AHI HSD after changes from #2545. Run on Python 3.10 with my Geo2Grid software (very small differences compared to Satpy default behavior). This is generating just true_color to geotiffs.

Chunk size of 2200

Due to the preferred chunks this resulted in chunks of (1650, 550) and performed basically the same except it took a little longer:

Chunk size of 2048

I then forced the chunk size to 2048 (removing the dask auto chunking) and it performed almost exactly like the 2200 plot at the top of this comment.

Then I did a crude resolution based chunking. The main interesting thing here is the low memory usage at the first half of the processing. Execution time and peak memory is only slightly different. Note that my hypothesis here is that geotiff writing isn't keeping up with the processing and it is holding on to chunks as GDAL compresses it and write it to disk; that's where the memory usage is coming from.

Resolution-based chunking at 2200

For my own reference, remaining tests are using a different dataset (still full disk), but one with a lot more daytime data. Performance seems to be about the same for the above cases. So I then ran a different script which removed the geotiff saving and this is what I see. About 2GB less memory usage. I'm pretty sure this is exactly what Geo2Grid is running just without the writing (a "pass thru" destination for the dask arrays). BUT interestingly about the same amount of processing time...not sure how that works.

Resolution-based chunking at 2200 (no writing)

Here is the non-resolution-based chunking again:

Chunk size of 2200 (no writing)

So the above shows a similar response to the geotiff-written version of these plots. 1 to 2GB of memory less when resolution-based chunking is used. I assume this covers the native resampling and then it plateaus. In the geotiff-written version we keep increasing as GDAL compresses and writes to disk.

Now, I noticed that we're producing 64-bit floats in the reader and I'm not sure that is necessary. Let's see what happens when we do 32-bit floats...

Oct 02 '23 19:10 djhoese

Here's 32-bit floats while also changing the chunk sizes from dask "auto" being told we're doing 32-bit floats. It was doing 64-bit floats in my previous comments which results in the 500m band be 2200 in the rows dimension (one segment) and 2200 in the columns. The 32-bit float change means 2200 in the rows and 4400 in the columns. Much larger chunks.

Here is when I go back to the 64-bit-based chunks (2200, 2200). So smaller memory usage.

Uh, oh. I just realized the final result is 64-bit floats. Great. Now I need to figure out where that is happening.

Oct 02 '23 19:10 djhoese

Ok, besides a few spots in the reader's calibration, the main jump to 64-bit floats is in sunz correction. @simonrp84 and I had a conversation in slack (satpy channel). At least 15 seconds faster and at least 1GB less memory usage...oh based on the top plot in my previous comment maybe closer to 50s and 4GB of memory better.

Oct 02 '23 20:10 djhoese