YAXArrays.jl
YAXArrays.jl copied to clipboard
Merging Cube/Dataset in a single "YAXArray"?
Hello!
I play with climate simulations, with Dataset/files being in separate Zarr directory. I'd like the combine them into a single YAXArray, but so far, I have not succeeded.
Here's a MWE:
hist_store = "gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/historical/r1i1p1f1/day/tasmax/gr1/v20180701/"
fut_store = "gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-CM4/ssp245/r1i1p1f1/day/tasmax/gr1/v20180701/"
hist = Cube(open_dataset(zopen(hist_store)))
fut = Cube(open_dataset(zopen(fut_store)))
hist has the following dimensions:
288×180×60225 YAXArray{Float32,3} with dimensions:
Dim{:lon} Sampled{Float64} 0.625:1.25:359.375 ForwardOrdered Regular Points,
Dim{:lat} Sampled{Float64} -89.5:1.0:89.5 ForwardOrdered Regular Points,
Ti Sampled{CFTime.DateTimeNoLeap} CFTime.DateTimeNoLeap[CFTime.DateTimeNoLeap(1850-01-01T12:00:00), …, CFTime.DateTimeNoLeap(2014-12-31T12:00:00)] ForwardOrdered Irregular Points
units: K
name: tasmax
Total size: 11.63 GB
and fut have the following dimensions:
288×180×31390 YAXArray{Float32,3} with dimensions:
Dim{:lon} Sampled{Float64} 0.625:1.25:359.375 ForwardOrdered Regular Points,
Dim{:lat} Sampled{Float64} -89.5:1.0:89.5 ForwardOrdered Regular Points,
Ti Sampled{CFTime.DateTimeNoLeap} CFTime.DateTimeNoLeap[CFTime.DateTimeNoLeap(2015-01-01T12:00:00), …, CFTime.DateTimeNoLeap(2100-12-31T12:00:00)] ForwardOrdered Irregular Points
units: K
name: tasmax
Total size: 6.06 GB
So, they both have the same spatial dimensions sizes but have different number of time-steps. I'd like to have a single YAXArray:
288×180×91615 YAXArray{Float32,3} with dimensions:
Dim{:lon} Sampled{Float64} 0.625:1.25:359.375 ForwardOrdered Regular Points,
Dim{:lat} Sampled{Float64} -89.5:1.0:89.5 ForwardOrdered Regular Points,
Ti Sampled{CFTime.DateTimeNoLeap} CFTime.DateTimeNoLeap[CFTime.DateTimeNoLeap(1850-01-01T12:00:00),, …, CFTime.DateTimeNoLeap(2100-12-31T12:00:00)] ForwardOrdered Irregular Points
units: K
name: tasmax
Total size: 17.69 GB
However, I did not find a command where this can be achieved (tried DimStack from DimensionalData and `concatenatecubes``from YAXArrays)
since the time axis is different, you cannot have a YAXArray, what I would do and normally do, if they need to be in the same file is to put them into the same file as a Dataset, where the common axis will be shared.
https://juliadatacubes.github.io/YAXArrays.jl/dev/examples/generated/UserGuide/saving/#appending-to-a-dataset
be careful at overwriting.
ok, I will try that. The idea though is not to save the resulting YAXArray, but would have liked to do further analysis and manipulation and then save the results. I would have liked to combine them into a single time dimension since they do not overlap by design. Hence, timeseries manipulation would have been easier down the road.
It should be possible to achieve that but there might be no proper api for it. You could try to play around with the to_dataset and merge dataset functions. I try to post an example on Monday.
Playing with it on a computer I realised, that you should be able to use the cat function from Base. Unfortunately that doesn't work at the moment with your data because it is horribly slow. It is no doing the concatenation lazily and therefore tries to download the whole multi GB dataset.
This seems to be coming from a missing implementation of Base._cat for the underlying Zarr Array.
I don't understand the cat code in DimensionalData not well enough to better pinpoint. @rafaqz could you give a hint, what we would need to do to properly have cat for YAXArrays with Zarr data?
As a workaround you can construct the YAXArray yourself by concatenating the Zarr Arrays and then constructing the axes by hand, this is a bit brittle, because it depends on the time axis being on third position
julia> fulldata = cat(hist.data, fut.data, dims=3)
Disk Array with size 288 x 180 x 91615
julia> full = YAXArray((hist.axes[1], hist.axes[2], vcat(hist.axes[3], fut.axes[3])),fulldata)
288×180×91615 YAXArray{Float32,3} with dimensions:
Dim{:lon} Sampled{Float64} 0.625:1.25:359.375 ForwardOrdered Regular Points,
Dim{:lat} Sampled{Float64} -89.5:1.0:89.5 ForwardOrdered Regular Points,
Ti Sampled{CFTime.DateTimeNoLeap} CFTime.DateTimeNoLeap[CFTime.DateTimeNoLeap(1850-01-01T12:00:00), …, CFTime.DateTimeNoLeap(2100-12-31T12:00:00)] ForwardOrdered Irregular Points
Total size: 17.69 GB
The full error I get when I stop the execution of the cat is the following:
julia> @time full = cat(hist, fut, dims=Ti)
^C
ERROR: InterruptException:
Stacktrace:
[1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
@ Base ./task.jl:920
[2] wait()
@ Base ./task.jl:984
[3] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
@ Base ./condition.jl:130
[4] wait
@ ./condition.jl:125 [inlined]
[5] take_buffered(c::Channel{Pair{CartesianIndex{3}, Union{Nothing, Vector{UInt8}}}})
@ Base ./channels.jl:457
[6] take!
@ ./channels.jl:451 [inlined]
[7] readblock!(aout::Array{Float32, 3}, z::ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}, r::CartesianIndices{3, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}})
@ Zarr ~/.julia/packages/Zarr/jgFcc/src/ZArray.jl:172
[8] readblock!(::ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}, ::Array{Float32, 3}, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64}})
@ Zarr ~/.julia/packages/Zarr/jgFcc/src/ZArray.jl:247
[9] getindex_disk(::ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64}})
@ DiskArrays ~/.julia/packages/DiskArrays/Tb4bI/src/diskarray.jl:40
[10] getindex(::ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}, ::UnitRange{Int64}, ::UnitRange{Int64}, ::UnitRange{Int64})
@ DiskArrays ~/.julia/packages/DiskArrays/Tb4bI/src/DiskArrays.jl:42
[11] iterate(a::ZArray{Float32, 3, Zarr.BloscCompressor, GCStore})
@ DiskArrays ~/.julia/packages/DiskArrays/Tb4bI/src/DiskArrays.jl:48
[12] macro expansion
@ ./multidimensional.jl:926 [inlined]
[13] _unsafe_setindex!(::IndexLinear, ::Array{Float32, 3}, ::ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}, ::UnitRange{Int64}, ::UnitRange{Int64}, ::UnitRange{Int64})
@ Base ./multidimensional.jl:939
[14] _setindex!
@ ./multidimensional.jl:916 [inlined]
[15] setindex!
@ ./abstractarray.jl:1397 [inlined]
[16] __cat_offset1!
@ ./abstractarray.jl:1811 [inlined]
[17] __cat_offset!(A::Array{Float32, 3}, shape::Tuple{Int64, Int64, Int64}, catdims::Tuple{Bool, Bool, Bool}, offsets::Tuple{Int64, Int64, Int64}, x::ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}, X::ZArray{Float32, 3, Zarr.BloscCompressor, GCStore})
@ Base ./abstractarray.jl:1801
[18] __cat(::Array{Float32, 3}, ::Tuple{Int64, Int64, Int64}, ::Tuple{Bool, Bool, Bool}, ::ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}, ::Vararg{ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}})
@ Base ./abstractarray.jl:1797
[19] _cat_t(::Tuple{Int64}, ::Type{Float32}, ::ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}, ::Vararg{ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}})
@ Base ./abstractarray.jl:1790
[20] _cat(catdims::Tuple{Ti{DimensionalData.Dimensions.LookupArrays.NoLookup{DimensionalData.Dimensions.LookupArrays.AutoIndex}}}, A1::YAXArray{Float32, 3, ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}, Tuple{Dim{:lon, DimensionalData.Dimensions.LookupArrays.Sampled{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Float64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:lat, DimensionalData.Dimensions.LookupArrays.Sampled{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Float64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Ti{DimensionalData.Dimensions.LookupArrays.Sampled{CFTime.DateTimeNoLeap, Vector{CFTime.DateTimeNoLeap}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Irregular{Tuple{Nothing, Nothing}}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}}, As::YAXArray{Float32, 3, ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}, Tuple{Dim{:lon, DimensionalData.Dimensions.LookupArrays.Sampled{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Float64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:lat, DimensionalData.Dimensions.LookupArrays.Sampled{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Float64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Ti{DimensionalData.Dimensions.LookupArrays.Sampled{CFTime.DateTimeNoLeap, Vector{CFTime.DateTimeNoLeap}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Irregular{Tuple{Nothing, Nothing}}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}})
@ DimensionalData ~/.julia/packages/DimensionalData/4TpBG/src/array/methods.jl:267
[21] _cat(_catdims::Tuple{UnionAll}, A1::YAXArray{Float32, 3, ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}, Tuple{Dim{:lon, DimensionalData.Dimensions.LookupArrays.Sampled{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Float64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:lat, DimensionalData.Dimensions.LookupArrays.Sampled{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Float64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Ti{DimensionalData.Dimensions.LookupArrays.Sampled{CFTime.DateTimeNoLeap, Vector{CFTime.DateTimeNoLeap}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Irregular{Tuple{Nothing, Nothing}}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}}, As::YAXArray{Float32, 3, ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}, Tuple{Dim{:lon, DimensionalData.Dimensions.LookupArrays.Sampled{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Float64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:lat, DimensionalData.Dimensions.LookupArrays.Sampled{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Float64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Ti{DimensionalData.Dimensions.LookupArrays.Sampled{CFTime.DateTimeNoLeap, Vector{CFTime.DateTimeNoLeap}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Irregular{Tuple{Nothing, Nothing}}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}})
@ DimensionalData ~/.julia/packages/DimensionalData/4TpBG/src/array/methods.jl:222
[22] _cat(::Type{Ti}, ::YAXArray{Float32, 3, ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}, Tuple{Dim{:lon, DimensionalData.Dimensions.LookupArrays.Sampled{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Float64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:lat, DimensionalData.Dimensions.LookupArrays.Sampled{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Float64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Ti{DimensionalData.Dimensions.LookupArrays.Sampled{CFTime.DateTimeNoLeap, Vector{CFTime.DateTimeNoLeap}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Irregular{Tuple{Nothing, Nothing}}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}}, ::YAXArray{Float32, 3, ZArray{Float32, 3, Zarr.BloscCompressor, GCStore}, Tuple{Dim{:lon, DimensionalData.Dimensions.LookupArrays.Sampled{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Float64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:lat, DimensionalData.Dimensions.LookupArrays.Sampled{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Float64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Ti{DimensionalData.Dimensions.LookupArrays.Sampled{CFTime.DateTimeNoLeap, Vector{CFTime.DateTimeNoLeap}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Irregular{Tuple{Nothing, Nothing}}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}})
@ DimensionalData ~/.julia/packages/DimensionalData/4TpBG/src/array/methods.jl:225
[23] #cat#153
@ ./abstractarray.jl:1979 [inlined]
[24] top-level scope
@ ./timing.jl:273 [inlined]
[25] top-level scope
@ ./REPL[107]:0
Dont you just need the latest DiskArrays.jl? cat is now lazy. (readblock! should not be called or DD is not propagating cat properly)
Seems I never actually registered the patch bump 😅
Try adding main
DD is not propagating cat properly I just checked it on the DiskArrays main branch and it works for the data level but not when wrapped in a YAXArray.
The problem is, that DD is using _cat instead of cat and _cat is not implemented in DiskArrays. And therefore we hit some abstractarray _cat fallback which uses setindex. See [15] in the error above. The error with DiskArrays#main is basically the same as the one posted above.
Ok right that should be an easy fix, it will be bundled with the next breaking change, which is all about cat anyway.
This is the monthe we fix cat everywhere 😀
As a workaround you can construct the YAXArray yourself by concatenating the Zarr Arrays and then constructing the axes by hand, this is a bit brittle, because it depends on the time axis being on third position
Thanks!
Using the provided code, it does work without loading everything in memory as far as I can tell (note that the data is actually pre-downloaded on our servers, the gs link was provided for the MWE).
I will have to see if every simulation have the same dimension ordering (probably not!).
I will keep an eye on the lazy cat version. Thanks!
It is no doing the concatenation lazily and therefore tries to download the whole multi GB dataset.
Regarding the laziness, my files are on my cluster and I cat and YAXArray 3 datasets for a total of ~53GB. Took about 1 second and RAM consumption on the cluster is about 1GB. So, cat being non-lazy refers mainly to remote Datasets such as S3 and GS stores?
It is no doing the concatenation lazily and therefore tries to download the whole multi GB dataset.
Regarding the laziness, my files are on my cluster and I
catandYAXArray3 datasets for a total of ~53GB. Took about 1 second and RAM consumption on the cluster is about 1GB. So,catbeing non-lazy refers mainly to remote Datasets such as S3 and GS stores?
Did you use cat directly on the YAXArrays? Or did you use cat on the data and then wrap it in a YAXArray afterwards?
The second version should be lazily, because Zarr implements cat but this method of cat is not yet used by DimensionalData and therefore the first variant would not be lazy.
There should not be a difference between zarr files on the cluster and those on the web, but with data to download the lag in indexing is just more pronounced.
ah, I see! I did not understood the difference. Thanks!
I did use cat on the data (hist.data) and then wrapped it into a YAXArray, using the code you provided.
Yeah, it doesnt make sense that you have to do this. In a few days you wont have to when DD and DiskArray fixes are merged :)