NCDatasets.jl Is it possible to not use `missing`?

I've been looking through the code but cant quite find where this happens.

Always replacing _FillValue with missing is mostly useful, but sometimes you don't want missing in an array (e.g. on GPU). Is it possible to still use get the other CFtransforms but not replace the _FillValue?

Jun 05 '21 22:06 rafaqz

Would this not lead to ambiguities? For example if the data is on disk would be [10,100] as the scale factor would be 10 and the _FillValue = 100. If you do not replacing the _FillValue, you would get [100,100] if I understand your proposition correctly.

For your information, here is the place where the substitution is done: https://github.com/Alexander-Barth/NCDatasets.jl/blob/master/src/cfvariable.jl#L505

Jun 06 '21 19:06 Alexander-Barth

Ok that makes sense.

Probably the FillValue would need to be transformed as well I guess, which should remove that ambiguity?

Another option is choosing a fill value, like NaN, etc.

It would be good to be able to set the missing value before a file is loaded, for a few reasons. Other values can be much faster than Missing on CPUs in a modelling context, and being able to copy directly to GPU without preprocessing is useful.

When you a using a stream of thousands of files, overall much larger than memory, being able to apply things like this lazily at load time is useful.

Jun 06 '21 23:06 rafaqz

If you want to propose an interface for using a different value for _FillValue, this would be helpful. Currently one need to do this:

ds = NCDataset("file.nc");
float_var = nomissing(ds["float_var"][:,:],9999.) # or coalesce
int_var = nomissing(ds["float_var"][:,:],123) 
close(ds)

I am not sure if you are aware of the NCDatasets.nomissing function.

Jul 02 '21 13:07 Alexander-Barth

Thanks, I wasn't aware of nomissing. But I was hoping this could happen without both conversion steps and the extra memory allocations that requires.

We could use a keyword argument like fillvalue to a method like read:

read(ds["float_var"]; fillvalue=NaN)

And the default fillvalue would be missing.

Jul 02 '21 14:07 rafaqz

And reading a subset, should this be like read(ds["float_var"],:,:,1; fillvalue=NaN) which is a bit similar to view(array,:,:,1) ?

For information, there is also the experimental/un-exported in-place NCDatasets.load! function:

# NCDatasets.load!(ncvar::Variable, data, indices)
  ds = Dataset("file.nc")
  ncv = ds["vgos"].var;
  # data must have the right shape and type
  data = zeros(eltype(ncv),size(ncv));
  NCDatasets.load!(ncv,data,:,:,:)
  close(ds)
  
  # loading a subset
  data = zeros(5); # must have the right shape and type
  NCDatasets.load!(ds["temp"].var,data,:,1) # loads the 1st column

Maybe load! should have been called read!...

Jul 02 '21 20:07 Alexander-Barth

Yes something like that.

I forgot it was load for NCDatasets. ArchGDAL and GeoData use read, as does HDF5, and Base julia. But some other packages seem to use load as well so it's not totally clear cut. Whichever it is could also have an argument like cf to specify if cf transformations are applied?

load(var, I...; cf=true, fillvalue=NaN, some_other_kw=x)
load!(var, data, I...; cf=true, fillvalue=NaN, some_other_kw=x)

But, thinking about it this isn't actually a complete solution. With DiskArrays.jl chunking we may want this information as fields of the object so each chunk will be loaded with the right transformation:

With a struct like this:

struct NCDiskArray{HC,EC,FV}
    haschunks::HC
    eachchunk::EC
    fillvalue::FV
    cf::Bool
end

Or something.

Jul 02 '21 23:07 rafaqz

the proposed interface where one specifies the missing value via a keyword seems totally fine with me (provided that the default remains missing)!

Feb 22 '22 11:02 Datseris

This is now implemented as described here https://alexander-barth.github.io/NCDatasets.jl/dev/other/#Fill-values-and-missing-values

Thank you for your suggestions and feedback.

Feb 01 '24 09:02 Alexander-Barth

NCDatasets.jl NCDatasets.jl copied to clipboard

Is it possible to not use `missing`?

NCDatasets.jl
NCDatasets.jl copied to clipboard