Rasters.jl icon indicating copy to clipboard operation
Rasters.jl copied to clipboard

Add ZarrDataset extension

Open felixcremer opened this issue 1 year ago • 23 comments
trafficstars

This is a first draft for opening zarr data with ZarrDatasets in Rasters. I just copied the approach in the GRIBDatasets extension into a new extension.

This would still need some tests.

When I load the same data with this branch or with YAXArrays I get the following dimensions, hereby ds is the data loaded with YAXArrays and rs with Rasters. This is the ESDC tiny cube which is available here "https://s3.bgc-jena.mpg.de:9000/esdl-esdc-v3.0.2/esdc-16d-2.5deg-46x72x1440-3.0.2.zarr":

julia> dims(ds.burnt_area)
↓ lon Sampled{Float64} -178.75:2.5:178.75 ForwardOrdered Regular Points,
→ lat Sampled{Float64} -88.75:2.5:88.75 ForwardOrdered Regular Points,
↗ Ti  Sampled{DateTime} [1979-01-09T00:00:00, …, 2021-12-27T00:00:00] ForwardOrdered Irregular Points

julia> dims(rs.burnt_area)
↓ X  Mapped{Union{Missing, Float64}} [-178.75, -176.25, …, 176.25, 178.75] ForwardOrdered Regular Points,
→ Y  Mapped{Union{Missing, Float64}} [-88.75, -86.25, …, 86.25, 88.75] ForwardOrdered Regular Points,
↗ Ti Sampled{DateTime} [1979-01-09T00:00:00, …, 2021-12-27T00:00:00] ForwardOrdered Irregular Points

and for another example.:

julia> dims(ras)
↓ X  Mapped{Union{Missing, Float64}} [-15.998697916666666, -15.99609375, …, 10.662760416666666, 10.66536458333333] ForwardOrdered Explicit Intervals{Center},
→ Y  Mapped{Union{Missing, Float64}} [62.66536458333333, 62.662760416666664, …, 48.00390625, 48.001302083333336] ReverseOrdered Explicit Intervals{Center},
↗ Ti Sampled{Union{Missing, DateTime}} [DateTime("2017-06-05T12:00:00"), DateTime("2017-06-06T12:00:00"), DateTime("2017-06-07T12:00:00")] ForwardOrdered Explicit Intervals{Center}

julia> dims(yax)
↓ lon Sampled{Float64} -15.998697916666666:0.0026041666666666665:10.66536458333333 ForwardOrdered Regular Points,
→ lat Sampled{Float64} 62.66536458333333:-0.0026041666666666652:48.001302083333336 ReverseOrdered Regular Points,
↗ Ti  Sampled{DateTime} [2017-06-05T12:00:00, …, 2017-06-07T12:00:00] ForwardOrdered Irregular Points

I am not sure what are the expected dimensions for these datasets.

felixcremer avatar Apr 24 '24 15:04 felixcremer

Before we could merge this, we need to register ZarrDatasets

felixcremer avatar Apr 24 '24 18:04 felixcremer

The dimension name difference is because Rasters switches some common x/y/z names into explicit X/Y/Z to standardize algorithms (e.g. rasterize knows to automatically rasterize polygons into X/Y dimensions and not others).

There are some arguments for and against this, its a fairly random decision. But it definitely makes running the other alogorithms easier, and lets us check other things like broadcasts aren't mixing dimensions.

rafaqz avatar Apr 29 '24 16:04 rafaqz

The name differences is something that I would be fine with, but others are a bit more hesitant to have. I got a lot of issues in YAXArrays because with the switch to DimensionalData we automatically converted Time axes to Ti. Maybe this could be a keyword argument when people are very keen on keeping their axes names or we could make the detected dimension a subtype of XDim without changing the name?

The other thing that I saw is that YAXArrays makes points and Rasters makes Intervals. And YAXArrays is using more Ranges instead of vectors also the types of the Axes seem to be off with the Rasters version I would not expect to have Missing values in the dimensions.

I am going to explore this further, but I rebased this today on main and the overhaul keywords PR broke some things which I am going through at the moment.

felixcremer avatar May 02 '24 14:05 felixcremer

It was not your PR but I made a dirty rebase. This is fixed now.

felixcremer avatar May 02 '24 15:05 felixcremer

Codecov Report

Attention: Patch coverage is 44.44444% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 82.43%. Comparing base (a15ebb1) to head (98d837c). Report is 13 commits behind head on main.

:exclamation: Current head 98d837c differs from pull request most recent head ff3fad8

Please upload reports for the commit ff3fad8 to get more accurate results.

Files Patch % Lines
ext/RastersZarrDatasetsExt/zarrdatasets_source.jl 0.00% 10 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #640      +/-   ##
==========================================
+ Coverage   82.32%   82.43%   +0.10%     
==========================================
  Files          60       61       +1     
  Lines        4357     4480     +123     
==========================================
+ Hits         3587     3693     +106     
- Misses        770      787      +17     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar May 02 '24 20:05 codecov[bot]

Missing values in the dimensions may be a CommonDataModel/NCDatsets bug, I think its forbidden in the CF standard too. Rasters would not introduce that.

Rasters will only give you Intervals if there is a bounds variable or some other metadata indicating intervals. My guess is YAX just doesn't check. But you really want intervals if you have them, so Contains and things that depend on it like extract work. Intervals let you assume the array value is the mean for the whole pixel, but you cant really do that for points.

We could use ranges instead of vectors for regular, but you do introduce the occasional floating point error.

rafaqz avatar May 02 '24 21:05 rafaqz

Oh right my issue for the missing values was closed: https://github.com/Alexander-Barth/NCDatasets.jl/issues/185

We could just run map(identity, lookup) over the vectors before using them to remove Missing.

@Alexander-Barth I think we really more control over missing value handling in CommonDataModel in general. Is there a way to load dim variables to force ignoring the _FillValue ?. It seems people don't want the Missing type in their CF dimension variables. And allocating a vector to add the Missing type, then another one to remove it again seems unnecessary to me.

@felixcremer also note the problem comes from incorrect files that have _FillValue properties for dim variables when CF disallows them.

rafaqz avatar May 02 '24 21:05 rafaqz

#655 fixes the missing problem on the Rasters side for now

rafaqz avatar May 02 '24 21:05 rafaqz

@felixcremer also note the problem comes from incorrect files that have _FillValue properties for dim variables when CF disallows them.

In Zarr the FillValue does not necessarily mean that it is missing. It is the default value that you would get when a chunk is not written. This is often used to mean missing and in Zarr.jl there is also a keyword fillvalueasmissing.

Yes I think, that YAX does not really check for the bounds variables and it only discards them.

I am going to look into properly testing this tomorrow.

felixcremer avatar May 02 '24 21:05 felixcremer

Ok in that case we really need a way to stop CommonDataModel from replacing the FillValue with missing at all.

I'm going to overhaul missing handling and CF standards application soon, I don't like everything from CommonDataModel automatically replacing missing everywhere.

rafaqz avatar May 02 '24 22:05 rafaqz

In Zarr the FillValue does not necessarily mean that it is missing. It is the default value that you would get when a chunk is not written. This is often used to mean missing and in Zarr.jl there is also a keyword fillvalueasmissing.

Is that not analogous to NetCDF with unlimited dimensions? If time is an unlimited dimension and when you write only the second time slice, the first time will be treated with full of fill values.

You can ignore the defined FillValue by setting it to nothing:

fname = tempname()
ds = NCDataset(fname,"c")
defDim(ds,"lon",100)
v = defVar(ds,"lon",Float32,("lon",),fillvalue = 9999.)
close(ds)

ds = NCDataset(fname)
eltype(ds["lon"])
# Union{Missing, Float32}


ds = NCDataset(fname)
eltype(cfvariable(ds,"lon",fillvalue=nothing))
# Float32

(this code works with ZarrDatasets after this change)

As I understand, the Zarr spec, fill_value should not be set when it is not used:

fill_value A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used.

However, it seems that the Zarr array here set the fill value for the coordinate dimensions (despite not being used).

Alexander-Barth avatar May 03 '24 09:05 Alexander-Barth

You can ignore the defined FillValue by setting it to nothing:

We really need this at the point of loading the file or when getting a specific variable, for example by passing a keyword. Its not really practical to ask people to modify a file on disk to change how it loads.

rafaqz avatar May 03 '24 09:05 rafaqz

In my example, you do not need to modify the data on disk. fillvalue is a keyword argument of the function cfvariable. Setting it to nothing, makes NCDatasets to ignore its value on disk.

Alexander-Barth avatar May 05 '24 05:05 Alexander-Barth

A right I have to stop reading these on mobile. We can allow setting that from Rasters fairly easily.

rafaqz avatar May 05 '24 07:05 rafaqz

In this commit https://github.com/JuliaGeo/ZarrDatasets.jl/commit/a84ef0ae0d5d4d8e2036d763a2375c8256ef0da8 ZarrDatasets no longer considers Zarr fill_value the same as the CF _FillValue for coordinate variables. I think that this should help with the present issue.

(This issue was helpful to me to understand the differences between Zarr fill_value as CF _FillValue: https://github.com/pydata/xarray/issues/5475#issue-922804256 )

Alexander-Barth avatar May 07 '24 16:05 Alexander-Barth

Great! If also added fixes to Rasters to avoid Missing in lookups

rafaqz avatar May 07 '24 17:05 rafaqz

The missing values in the lookups is fixed. But now I am a bit confused by the Locus that is detected. For the GLDAS dataset http://tinyurl.com/GLDAS-NOAH025-3H the time bounds are correctly recognized and the dimensions are set as intervals. The locus of the time axis is set to Center but according to the metadata and also to the values in the time_bnds it should be the end of the interval. Here gldas is the dataset opened as a raster and gldasz is the same data opened as a Zarr.Zarray.

How is the Locus set when opening a dataset?

julia> tmb = gldasz.arrays["time_bnds"][:,:]
2×6 Matrix{Float64}:
 1.15717e7  1.15718e7  1.1572e7   1.15722e7  1.15724e7  1.15726e7
 1.15718e7  1.1572e7   1.15722e7  1.15724e7  1.15726e7  1.15727e7

julia> tm = gldasz.arrays["time"][:]
6-element Vector{Float64}:
 1.157184e7
 1.157202e7
 1.15722e7
 1.157238e7
 1.157256e7
 1.157274e7

julia> d .+ Minute.(tmb[1, :])
6-element Vector{DateTime}:
 2022-01-01T00:00:00
 2022-01-01T03:00:00
 2022-01-01T06:00:00
 2022-01-01T09:00:00
 2022-01-01T12:00:00
 2022-01-01T15:00:00

julia> d .+ Minute.(tm[:])
6-element Vector{DateTime}:
 2022-01-01T03:00:00
 2022-01-01T06:00:00
 2022-01-01T09:00:00
 2022-01-01T12:00:00
 2022-01-01T15:00:00
 2022-01-01T18:00:00

julia> d .+ Minute.(tmb[2, :])
6-element Vector{DateTime}:
 2022-01-01T03:00:00
 2022-01-01T06:00:00
 2022-01-01T09:00:00
 2022-01-01T12:00:00
 2022-01-01T15:00:00
 2022-01-01T18:00:00

julia> dims(gldas, Ti)
Ti Sampled{DateTime} ForwardOrdered Explicit DimensionalData.Dimensions.Lookups.Intervals{DimensionalData.Dimensions.Lookups.Center}
wrapping: 6-element Vector{DateTime}:
 2022-01-01T03:00:00
 2022-01-01T06:00:00
 2022-01-01T09:00:00
 2022-01-01T12:00:00
 2022-01-01T15:00:00
 2022-01-01T18:00:00
julia> gldasz.attrs["tavg definition:"]
"past 3-hour average"

felixcremer avatar May 13 '24 13:05 felixcremer

Ok its always assumed to be the center in CF standards, but I guess when a bounds matrix is provided we need to actually check that (and currently we dont). I have no idea what to do if its some other fractional position besides start/center/end...

If you have time to add the check based on bounds matrix values that would help, see

Probably all of this logic needs some kind of overhaul. https://github.com/rafaqz/Rasters.jl/blob/main/src/sources/commondatamodel.jl#L333-L350

One complication is we really don't want to get Explicit bounds matrix spans if we can have Regular because Regular is so much easier to work with.

CF says:

If bounds are not provided, an application might reasonably assume the gridpoints to be at the centers of the cells, but we do not require that in this standard.

So I've forgotten about the corrolary of that - when bounds are provided the gridpoints can be anywhere

rafaqz avatar May 13 '24 14:05 rafaqz

The problem is, with Zarr data we will for now always get explicit, because there is no standard yet that defines how to save the Regular dimensions in the zarr format. See the discussion in the geozarr standard repository https://github.com/zarr-developers/geozarr-spec/issues/17.

Or does Rasters try to convert an explicit Vector to a regular representation? What is the locus information used for when the data bounds are explicit?

felixcremer avatar May 13 '24 14:05 felixcremer

~~Oh that's the same with netcdf, bounds matrix is optional.~~ Oops I misread now for not. That is a pain, but we do try and check is Regular is ok anyway.

Yes we try to convert to Regular. But in hindsight I think it will be wrong in some edge cases that shouldn't exist, but could, and when the locus is not Center.

Butt when there is no bounds matrix according to CF we have to assume Center because as far as I know there is no metadata to indicate that position? Maybe zarr does specify that?

(I'm not sure that thread is relevant here, the bounds matrix is not the same as transformation variables... edit: because we already have this same problem in netcdf and just check for regularity... But also good to see the CF spec is as overwhelming for everyone else as it is for me...)

rafaqz avatar May 13 '24 15:05 rafaqz

I rebased the branch and add a check for the Locus to see whether the index is at one side of the boundaries and to set Start or End accordingly. I am not sure, whether we should really check whether the Center is in the middle between the boundaries, because what would we do with time axis where the index is at the 15th of the month and the end might be 30 or 31 and therefore the exact middle would always be off. I am not sure, how this is handled in other tools. Also I am not sure whether we would need the same check for the regular span or whether we could assume Center there.

felixcremer avatar May 15 '24 10:05 felixcremer

We always have to special case date time, so I think that's ok

rafaqz avatar May 15 '24 17:05 rafaqz

I am currently working on making the writevar! function backend agnostic so that we can use it for writing Zarr and NCDatasets this is currently broken. In the future I would also like to add some round trip tests where we load a netcdf, save it to zarr and open it up again and then save it back to netcdf and this should give the same netcdf file.

The documenter build is failing because of download issues with WorldClim and CI gave some strange malloc failure but I am hoping, that this was just a github action glitch.

felixcremer avatar May 16 '24 15:05 felixcremer

I'm writing some major changes to how cf standards, missing values etc are handled accross all the backends.

The will have pretty major clashes with what you are doing here - like in writevar. Any chance we can get this merged relatively soon before I do too much more of that PR?

rafaqz avatar Jun 16 '24 22:06 rafaqz

Any updates? would be good to merge this

rafaqz avatar Aug 02 '24 07:08 rafaqz

I'll have a look today. I am going to remove the write stuff, and focus just on reading zarr data for now, so that we can merge it soon.

felixcremer avatar Aug 02 '24 07:08 felixcremer

This is currently failing because of this test:

            @testset "non allowed values" begin
                @test_throws ArgumentError write(filename, convert.(Union{Missing,Float16}, ncarray); force=true)
            end

What is the purpose of this test? The Errror changed from an ArgumentError to a MethodError because there is no fillvalue(Float16) from CommonDataModel. I am not sure, whether this is something that fails now because of my changes. Apart from that I removed the write function for ZarrDatasets and we should try to make that in the future.

felixcremer avatar Aug 02 '24 15:08 felixcremer

Its just to make sure there is a sensible argument error rather than a random method error users don't understand.

We need to say "can't write $T" explicitly

rafaqz avatar Aug 03 '24 06:08 rafaqz

Than this is something, that I would address in this PR.

felixcremer avatar Aug 03 '24 08:08 felixcremer

Yeah, I'm not sure where it threw the ArgumentError before but looks like it did

rafaqz avatar Aug 03 '24 09:08 rafaqz