NCDatasets.jl
NCDatasets.jl copied to clipboard
Is it possible to not use `missing`?
I've been looking through the code but cant quite find where this happens.
Always replacing _FillValue with missing is mostly useful, but sometimes you don't want missing in an array (e.g. on GPU). Is it possible to still use get the other CFtransforms but not replace the _FillValue?
Would this not lead to ambiguities? For example if the data is on disk would be [10,100] as the scale factor would be 10 and the _FillValue = 100. If you do not replacing the _FillValue, you would get [100,100] if I understand your proposition correctly.
For your information, here is the place where the substitution is done: https://github.com/Alexander-Barth/NCDatasets.jl/blob/master/src/cfvariable.jl#L505
Ok that makes sense.
Probably the FillValue would need to be transformed as well I guess, which should remove that ambiguity?
Another option is choosing a fill value, like NaN, etc.
It would be good to be able to set the missing value before a file is loaded, for a few reasons. Other values can be much faster than Missing on CPUs in a modelling context, and being able to copy directly to GPU without preprocessing is useful.
When you a using a stream of thousands of files, overall much larger than memory, being able to apply things like this lazily at load time is useful.
If you want to propose an interface for using a different value for _FillValue, this would be helpful. Currently one need to do this:
ds = NCDataset("file.nc");
float_var = nomissing(ds["float_var"][:,:],9999.) # or coalesce
int_var = nomissing(ds["float_var"][:,:],123)
close(ds)
I am not sure if you are aware of the NCDatasets.nomissing function.
Thanks, I wasn't aware of nomissing. But I was hoping this could happen without both conversion steps and the extra memory allocations that requires.
We could use a keyword argument like fillvalue to a method like read:
read(ds["float_var"]; fillvalue=NaN)
And the default fillvalue would be missing.
And reading a subset, should this be like read(ds["float_var"],:,:,1; fillvalue=NaN) which is a bit similar to view(array,:,:,1) ?
For information, there is also the experimental/un-exported in-place NCDatasets.load! function:
# NCDatasets.load!(ncvar::Variable, data, indices)
ds = Dataset("file.nc")
ncv = ds["vgos"].var;
# data must have the right shape and type
data = zeros(eltype(ncv),size(ncv));
NCDatasets.load!(ncv,data,:,:,:)
close(ds)
# loading a subset
data = zeros(5); # must have the right shape and type
NCDatasets.load!(ds["temp"].var,data,:,1) # loads the 1st column
Maybe load! should have been called read!...
Yes something like that.
I forgot it was load for NCDatasets. ArchGDAL and GeoData use read, as does HDF5, and Base julia. But some other packages seem to use load as well so it's not totally clear cut. Whichever it is could also have an argument like cf to specify if cf transformations are applied?
load(var, I...; cf=true, fillvalue=NaN, some_other_kw=x)
load!(var, data, I...; cf=true, fillvalue=NaN, some_other_kw=x)
But, thinking about it this isn't actually a complete solution. With DiskArrays.jl chunking we may want this information as fields of the object so each chunk will be loaded with the right transformation:
With a struct like this:
struct NCDiskArray{HC,EC,FV}
haschunks::HC
eachchunk::EC
fillvalue::FV
cf::Bool
end
Or something.
the proposed interface where one specifies the missing value via a keyword seems totally fine with me (provided that the default remains missing)!
This is now implemented as described here https://alexander-barth.github.io/NCDatasets.jl/dev/other/#Fill-values-and-missing-values
Thank you for your suggestions and feedback.