Specifying the fill value when reading a file
If the _FillValue attribute is not set in a NetCDF file, it seems this package defaults to all arrays being read with Union{Missing,T} eltypes, i.e. a missing fill value. If writing the NetCDF file, I can change _FillValue to change this behavior, but is there also a way to specify the default fill value when the file is opened in read-only mode?
My main use case is either having the default fill value be NaN or having only a typeunion with Missing if values are actually missing.
If the _FillValue attribute is not set in a NetCDF file, it seems this package defaults to all arrays being read with Union{Missing,T} eltypes, i.e. a missing fill value.
This should not be the case (unless a missing_value is defined). Can you give a (small) example where you see this behaviour? Currently we look for the _FillValue and missing_value attributes.
julia> using NCDatasets;
# example from xarray
typeof(NCDataset("/home/abarth/.local/lib/python3.8/site-packages/xarray/tests/data/example_1.nc")["temp"][:])
Array{Float32, 4}
Ah, I don't have an example of this specific issue right now, but here's a perhaps related one. In this dataset, _FillValue is set to NaN, yet the eltype NCDatasets returns still has Missing:
julia> using NCDatasets
julia> f = Base.download("https://ndownloader.figshare.com/files/24067472");
julia> ds = NCDataset(f);
julia> g = ds.group["posterior"]["g"]
g (2 × 500 × 4)
Datatype: Float64
Dimensions: g_coef × draw × chain
Attributes:
_FillValue = NaN
julia> eltype(Array(g))
Union{Missing, Float64}
julia> close(ds);
OK, but in this example _FillValue is indeed set. We use missing because, NaN does not work for integers for example.
In your case, what you can do is one of the following:
julia> g = cfvariable(ds.group["posterior"],"g",fillvalue = nothing)
g (2 × 500 × 4)
Datatype: Float64
Dimensions: g_coef × draw × chain
Attributes:
_FillValue = NaN
julia> eltype(g)
Float64
julia> g = ds.group["posterior"]["g"].var
g (2 × 500 × 4)
Datatype: Float64
Dimensions: g_coef × draw × chain
Attributes:
_FillValue = NaN
julia> eltype(g)
Float64
The second approach ignores all CF conversions (add_offset, scale_factor and time conversion).
fillvalue = nothing ignores the fillvalue that is set, right? What would be a good way to mimic xarray behavior, to convert the missing data to NaN instead of missing? Regardless of whether _FillValue is NaN or say -1? This would only make sense for floating point data.
For the case where _FillValue is NaN, like your example, that would amount to the same thing as cfvariable(ds.group["posterior"],"g",fillvalue = nothing)
Why choose Union{T,Missing} instead of Union{T,Float64} when _FillValue=NaN? For T<:Float64, there is then no type union, and for Int arrays, one avoids injecting missings when the file specified NaNs?
fillvalue = nothing ignores the fillvalue that is set, right?
It is correct. If somebody would want to extend cfvariable to support e.g.
cfvariable(ds.group["posterior"],"g",sentinelvalue = NaN)
where all _FillValue gets replaced by sentinelvalue, that would be nice (It is unlikely that I find the time myself in the near term and I am not sure if this approach would be very convenient to use).
for Int arrays, one avoids injecting missings when the file specified NaNs?
I don't think that you cannot specify a FillValue of NaN for an Int array. The error message would be Not a valid data type or _FillValue type mismatch. Also julia quickly promotes Int to Floats when combined in an array with NaNs
julia> vcat([1,2],[NaN])
3-element Vector{Float64}:
1.0
2.0
NaN
This leads to issue similar to these: https://github.com/pydata/xarray/issues/1194
If we would use Union{T,Float64} when _FillValue=NaN then a user would need to check constantly with ismissing and with isnan if a value is valid or not or keep the _FillValue in the NetCDF file around.
Originally NCDatasets used DataArrays.jl which got deprecated in favor of Union{T,Missing}. This blog post explains well the rational of this approach.
Missing is also used in other packages like DataFrames.jl.
We have also the function nomissing if you want to use a different sentinel value:
v_with_nan = nomissing(ds["var"][:],NaN)
v_with_nan = nomissing(ds["var"][:]) # error if there is a missing value
The type signaling that an array may contain missing value can also be used for dispatch.
method(a::Vector{Union{Missing,Float64}}) = fast_method(fill_missing(a))
method(a::Vector{Float64}) = fast_method(a)
If we would use NaN as the only missing value (and substitute it to a different value when writing to a file), we would also consider its an impact on the size:
julia> Base.summarysize(Vector{Union{Int8,Missing}}(undef,100))
240
julia> Base.summarysize(Vector{Union{Int8,Float32}}(undef,100)) # for NaN32
540
julia> Base.summarysize(Vector{Union{Int8,Float64}}(undef,100))
940
Beside, integer and floats, NCDatasets can also return an array of Char, String, and DateTime. Having a NaN as missing value among those looks weird to me.
That being said: in other packages that I wrote, I also use NaN as missing value, because I can be sure to deal only with floating point numbers. However, for NCDatasets, I think a more generic approach is better.
Previous discussion: https://github.com/Alexander-Barth/NCDatasets.jl/issues/132
Thanks for the detailed answer. I agree that missing makes most sense as the default missing value.
Good to hear you'd be open to something like
cfvariable(ds.group["posterior"],"g",sentinelvalue = NaN)
Though I suspect for most cases this is also fine:
v_with_nan = nomissing(ds["var"][:],NaN)