YAXArrays.jl icon indicating copy to clipboard operation
YAXArrays.jl copied to clipboard

YAXArrays seems to download too much data

Open SimonDanisch opened this issue 1 year ago • 2 comments

I'm trying the example from the docs:

using Zarr, YAXArrays, Dates, DimensionalData

store = "gs://cmip6/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp585/r1i1p1f1/3hr/tas/gn/v20190710/"
g = open_dataset(zopen(store, consolidated=true))
c = g["tas"]
ct = c[Ti=At(Date("2018-08-01"):Day(10):Date("2050-08-01"))]

in_memory = ct.data[:, :, :]

This takes reaally long and fills up all my RAM (32gb). A few infos:

The selected slice:

image

Download speed of the julia process

image

I was expecting it to only download the 328mb, but from the download speed and RAM usage I suspect it's downloading much more data, making it almost impossible to download this part of the dataset... Am I missing something or is this a bug, or just a limitation of the package?

SimonDanisch avatar Jan 11 '24 15:01 SimonDanisch

One thought I have in mind reading the example. I might be wrong though.

Depending on the chunks of the zarr folder on Google, the specific slice asked will still need to download the whole dataset between 2018 and 2050, probably a little bit more for the edges on 2018 and 2050. The whole dataset between 2018 and 2050 is 3.21GB. Is it closer to your measurement?

c = g["tas"]
ct = c[Ti=At(Date("2018-08-01"):Date("2050-08-01"))]
384×192×11689 YAXArray{Float32,3} with dimensions: 
  Dim{:lon} Sampled{Float64} 0.0:0.9375:359.0625 ForwardOrdered Regular Points,
  Dim{:lat} Sampled{Float64} Float64[-89.28422753251364, -88.35700351866494, …, 88.35700351866494, 89.28422753251364] ForwardOrdered Irregular Points,
  Ti Sampled{DateTime} DateTime[2018-08-01T00:00:00, …, 2050-08-01T00:00:00] ForwardOrdered Irregular Points
units: K
name: tas
Total size: 3.21 GB

Balinus avatar Jan 12 '24 18:01 Balinus

Note that I tried to do the same approach in Python and it seems to behave similarly

(in python, I specified the whole timeseries between 2018 and 2050 for simplicity)

import xarray as xr
import zarr

file = 'gs://cmip6/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp585/r1i1p1f1/3hr/tas/gn/v20190710/'
ds = xr.open_dataset(file, engine='zarr')

c = ds.tas
ct = c.sel(time=slice("2018-08-01", "2050-08-01"))
%time ct.values

CPU times: user 3min 19s, sys: 1min 29s, total: 4min 49s
Wall time: 21min 58s
Out[12]:
array([[[216.41226, 216.48257, 216.44742, ..., 216.32828, 216.38297,
         216.40054],



Balinus avatar Jan 12 '24 19:01 Balinus