Satip icon indicating copy to clipboard operation
Satip copied to clipboard

Experimenting with chunk size in new Zarr

Open jacobbieker opened this issue 2 years ago • 1 comments

We are switching to a newer, compressed Zarr storage for the satellite data. Currently, the data is collected as individual files for each timestep, and so we are working on collating it, and are deciding what chunk sizes might work best for loading quickly off disk. Note: As the chunk sizes written become larger, the memory usage increases quite a lot, writing HRV as a single chunk per timestep filled up Donatello's 512GB of RAM and would crash. On the other hand, splitting each timestep into a 4x4 grid results in 16 workers being able to run concurrently without memory issues. This test data is the collated data for a random week in June of 2018, chosen as it should have quite a bit of sunlight, and therefore more varied timesteps than in other parts of the year.

I've done some benchmarking on Donatello with no other processes running. The collated.zarr has a chunk size of (1, 2088, 2784, 1), the collated_4.zarr has chunk size of (1, 1044, 1392, 1), 'collated_4_6.zarr' of (1, 1044, 928, 1), and collated_8.zarr of (1, 522, 696, 1)

import xarray as xr
import timeit
import jpeg_xl_float_with_nans
from satip.jpeg_xl_float_with_nans import JpegXlFloatWithNaNs
import time
from imagecodecs.numcodecs import JpegXl
base_url = "/mnt/storage_ssd_4tb/weekly_zarr/"

zarr_paths = [base_url+"hrv_201806010000_collated.zarr", 
              base_url+"hrv_201806010000_collated_4.zarr", 
              base_url+"hrv_201806010000_collated_4_6.zarr", 
              base_url+"hrv_201806010000_collated_8.zarr",] 

def test_open(zarr_path):
    read_ds = xr.open_zarr(str(zarr_path))
    return read_ds

def test_load(read_ds):
    read_ds.load()

def test_xy_block(read_ds):
    ds_rect = read_ds['data'].isel(x_geostationary=slice(1024,1536), y_geostationary=slice(1024,2048))
    _ = ds_rect.values

def test_time_block(read_ds):
    ds_rect = read_ds['data'].isel(time=slice(4, 20))
    _ = ds_rect.values

for zarr_path in zarr_paths:
    print(f"Testing: {zarr_path}")
    print(f"Opening: " + str(timeit.timeit('test_open(zarr_path)', number=5, globals=globals())))
    read_ds = test_open(zarr_path)
    #print(read_ds)
    print("Loading: " + str(timeit.timeit('test_load(read_ds)', number=5, globals=globals())))
    read_ds = test_open(zarr_path)
    #print("XY Block")
    print("XY Block: " + str(timeit.timeit('test_xy_block(read_ds)', number=5, globals=globals())))
    #print("Testing Time Block")
    print("Time Block: " + str(timeit.timeit('test_time_block(read_ds)', number=5, globals=globals())))

Results:

Testing: /mnt/storage_ssd_4tb/weekly_zarr/hrv_201806010000_collated.zarr
Opening: 0.059641293715685606
Loading: 58.61230262508616
XY Block: 42.23793370788917
Time Block: 2.1586094200611115
Testing: /mnt/storage_ssd_4tb/weekly_zarr/hrv_201806010000_collated_4.zarr
Opening: 0.05839049303904176
Loading: 60.35184264695272
XY Block: 45.3515136949718
Time Block: 2.3401300320401788
Testing: /mnt/storage_ssd_4tb/weekly_zarr/hrv_201806010000_collated_4_6.zarr
Opening: 0.050482844933867455
Loading: 74.40115682408214
XY Block: 20.16681743785739
Time Block: 3.0453611770644784
Testing: /mnt/storage_ssd_4tb/weekly_zarr/hrv_201806010000_collated_8.zarr
Opening: 0.050295036751776934
Loading: 193.9114824021235
XY Block: 77.03730340581387
Time Block: 7.912745867855847

So far, it looks like a chunk size of (1, 1044, 1392, 1) might ideal in terms of how many workers can write to disk concurrently, and the minimal difference in loading data compared to the larger chunked data, while still being quite a bit faster in loading than the other, smaller chunk sizes. @peterdudfield and @JackKelly are you guys happy with that? I can kick off the collation process with that chunk size if so, or can run some more tests.

jacobbieker avatar Aug 19 '22 13:08 jacobbieker

sounds good.

peterdudfield avatar Aug 23 '22 07:08 peterdudfield