NCDatasets.jl icon indicating copy to clipboard operation
NCDatasets.jl copied to clipboard

Specifying the fill value when reading a file

Open sethaxen opened this issue 3 years ago • 7 comments

If the _FillValue attribute is not set in a NetCDF file, it seems this package defaults to all arrays being read with Union{Missing,T} eltypes, i.e. a missing fill value. If writing the NetCDF file, I can change _FillValue to change this behavior, but is there also a way to specify the default fill value when the file is opened in read-only mode?

My main use case is either having the default fill value be NaN or having only a typeunion with Missing if values are actually missing.

sethaxen avatar Aug 24 '22 09:08 sethaxen

If the _FillValue attribute is not set in a NetCDF file, it seems this package defaults to all arrays being read with Union{Missing,T} eltypes, i.e. a missing fill value.

This should not be the case (unless a missing_value is defined). Can you give a (small) example where you see this behaviour? Currently we look for the _FillValue and missing_value attributes.

julia> using NCDatasets;
# example from xarray 
typeof(NCDataset("/home/abarth/.local/lib/python3.8/site-packages/xarray/tests/data/example_1.nc")["temp"][:])
Array{Float32, 4}

Alexander-Barth avatar Aug 24 '22 11:08 Alexander-Barth

Ah, I don't have an example of this specific issue right now, but here's a perhaps related one. In this dataset, _FillValue is set to NaN, yet the eltype NCDatasets returns still has Missing:

julia> using NCDatasets

julia> f = Base.download("https://ndownloader.figshare.com/files/24067472");

julia> ds = NCDataset(f);

julia> g = ds.group["posterior"]["g"]
g (2 × 500 × 4)
  Datatype:    Float64
  Dimensions:  g_coef × draw × chain
  Attributes:
   _FillValue           = NaN

julia> eltype(Array(g))
Union{Missing, Float64}

julia> close(ds);

sethaxen avatar Aug 24 '22 11:08 sethaxen

OK, but in this example _FillValue is indeed set. We use missing because, NaN does not work for integers for example.

In your case, what you can do is one of the following:

julia> g = cfvariable(ds.group["posterior"],"g",fillvalue = nothing)
g (2 × 500 × 4)
  Datatype:    Float64
  Dimensions:  g_coef × draw × chain
  Attributes:
   _FillValue           = NaN

julia> eltype(g)
Float64

julia> g = ds.group["posterior"]["g"].var
g (2 × 500 × 4)
  Datatype:    Float64
  Dimensions:  g_coef × draw × chain
  Attributes:
   _FillValue           = NaN

julia> eltype(g)
Float64

The second approach ignores all CF conversions (add_offset, scale_factor and time conversion).

Alexander-Barth avatar Aug 24 '22 11:08 Alexander-Barth

fillvalue = nothing ignores the fillvalue that is set, right? What would be a good way to mimic xarray behavior, to convert the missing data to NaN instead of missing? Regardless of whether _FillValue is NaN or say -1? This would only make sense for floating point data.

For the case where _FillValue is NaN, like your example, that would amount to the same thing as cfvariable(ds.group["posterior"],"g",fillvalue = nothing)

visr avatar Aug 24 '22 12:08 visr

Why choose Union{T,Missing} instead of Union{T,Float64} when _FillValue=NaN? For T<:Float64, there is then no type union, and for Int arrays, one avoids injecting missings when the file specified NaNs?

sethaxen avatar Aug 24 '22 12:08 sethaxen

fillvalue = nothing ignores the fillvalue that is set, right?

It is correct. If somebody would want to extend cfvariable to support e.g.

cfvariable(ds.group["posterior"],"g",sentinelvalue = NaN)

where all _FillValue gets replaced by sentinelvalue, that would be nice (It is unlikely that I find the time myself in the near term and I am not sure if this approach would be very convenient to use).

for Int arrays, one avoids injecting missings when the file specified NaNs?

I don't think that you cannot specify a FillValue of NaN for an Int array. The error message would be Not a valid data type or _FillValue type mismatch. Also julia quickly promotes Int to Floats when combined in an array with NaNs

julia> vcat([1,2],[NaN])
3-element Vector{Float64}:
   1.0
   2.0
 NaN

This leads to issue similar to these: https://github.com/pydata/xarray/issues/1194

If we would use Union{T,Float64} when _FillValue=NaN then a user would need to check constantly with ismissing and with isnan if a value is valid or not or keep the _FillValue in the NetCDF file around.

Originally NCDatasets used DataArrays.jl which got deprecated in favor of Union{T,Missing}. This blog post explains well the rational of this approach. Missing is also used in other packages like DataFrames.jl.

We have also the function nomissing if you want to use a different sentinel value:

v_with_nan = nomissing(ds["var"][:],NaN)
v_with_nan = nomissing(ds["var"][:]) # error if there is a missing value

The type signaling that an array may contain missing value can also be used for dispatch.

method(a::Vector{Union{Missing,Float64}}) = fast_method(fill_missing(a))
method(a::Vector{Float64}) = fast_method(a)

If we would use NaN as the only missing value (and substitute it to a different value when writing to a file), we would also consider its an impact on the size:

julia> Base.summarysize(Vector{Union{Int8,Missing}}(undef,100))
240

julia> Base.summarysize(Vector{Union{Int8,Float32}}(undef,100)) # for NaN32
540

julia> Base.summarysize(Vector{Union{Int8,Float64}}(undef,100))
940

Beside, integer and floats, NCDatasets can also return an array of Char, String, and DateTime. Having a NaN as missing value among those looks weird to me.

That being said: in other packages that I wrote, I also use NaN as missing value, because I can be sure to deal only with floating point numbers. However, for NCDatasets, I think a more generic approach is better.

Previous discussion: https://github.com/Alexander-Barth/NCDatasets.jl/issues/132

Alexander-Barth avatar Aug 25 '22 11:08 Alexander-Barth

Thanks for the detailed answer. I agree that missing makes most sense as the default missing value.

Good to hear you'd be open to something like

cfvariable(ds.group["posterior"],"g",sentinelvalue = NaN)

Though I suspect for most cases this is also fine:

v_with_nan = nomissing(ds["var"][:],NaN)

visr avatar Aug 25 '22 11:08 visr