NCDatasets.jl
NCDatasets.jl copied to clipboard
Use DiskArrays for indexing, broadcasting etc
I open this issue to discuss if this package should use DIskArrays.jl to implement efficient indexing + reductions and broadcasting for NCDatasets. This would enable workflows like this without loading all data into memory first.
There would be 3 ways to combine the two packages:
- Make
Variable
a direct subtype ofAbstractDiskArray
, implement the necessary methods (analogous to this PR) and delete the old getindex/setindex This would add a hard dependency on DiskArrays - Use the
implement_diskarray
macro to directly overwrite the Base methods. This could happen withRequires
so that users only get this behavior when they load the DIskArrays package, This would add a soft dependency on DIskArrays - Simply show in the Readme/docs how to wrap a
Variable
into a DiskArray, e.g. like in the unit tests https://github.com/meggart/DiskArrays.jl/blob/master/test/runtests.jl#L6-L38 Then the user can simply decide how to use a Variable, but would have to add some boilerplate code.
Let me know if you have questions or which option you would prefer in general and I can help with a PR
I just tried the master version of your NetCDF.jl package and the DiskArray integrations works very nice and natural! Do I understand it correctly, that the disk access will be done in chunks of data?
It seems that getchunksize
/ haschunks
are the chunk size / chunk mode as defined in the NetCDF file. is this correct? Can you explain how NetCDF.jl / DiskArrays.jl behave if they are not defined?
Thanks!
Do I understand it correctly, that the disk access will be done in chunks of data? Yes
It seems that getchunksize / haschunks are the chunk size / chunk mode as defined in the NetCDF file. is this correct? Yes
In case the chunks are not defined, DiskArrays is still operating on chunks, but it is falling back to this function https://github.com/meggart/DiskArrays.jl/blob/master/src/chunks.jl#L65-L77 which simply guesses a chunk size. This is controlled by the parameter default_chunk_size
which sets the approximate amount of data you want to hold in memory at a time (defaults to 100Mb):
julia> using DiskArrays
julia> DiskArrays.estimate_chunksize((1000,1000,1000),sizeof(Float64))
(1000, 1000, 12)
julia> DiskArrays.default_chunk_size[] = 1 #Se chunks to 1Mb only
1
julia> DiskArrays.estimate_chunksize((1000,1000,1000),sizeof(Float64))
(1000, 125, 1)
And just to add one thing. The chunks are only used for broadcast and reductions. Simple calls to getindex
and setindex!
will directly be translated to a readblock!
call.
[...] DickArrays.jl [...]
We need built-in auto-spelling for this package 😄
off-topic
Nice work with DiskArrays! Combined with ESDL and built-in distributed workload I feel we that the JuliaClimate stack is way better now that it was when I began developping ClimateTools 3-4 years ago. I will look into how I can use those packages and develop more climate analysis on top of them.
I made a try in a branch DiskArrays
but I am hitting this error:
Got exception outside of a @test
ArgumentError: Indices of type Tuple{StepRange{Int64,Int64},StepRange{Int64,Int64}} are not yet
Stacktrace:
[1] interpret_indices_disk(::NCDatasets.Variable{Float64,2}, ::Tuple{StepRange{Int64,Int64},StepRange{Int64,Int64}}) at /Users/travis/.julia/packages/DiskArrays/RJ0eT/src/DiskArrays.jl:61
I guess it comes from here:
https://github.com/meggart/DiskArrays.jl/blob/f84f40afe9b0d381732103b692133d641d30b7e8/src/DiskArrays.jl#L61
It seems that DiskArrays only support indexing with a stride (step) equal to one. Is this correct? Would it be possible to also support Julia 1.0.5 (I still support this version in my latest release)? But I admit, it is quite annoying to do so. Maybe it is better if I report such question in your issue tracker of DiskArrays; but let me know what you prefer.
Having issues collected here in this thread is fine. Regarding the Julia compatibility, I just created a branch that checks for the Julia version and for Julia 1.0 only implements getindex/setindex! but omits the other niceties like reductions and broadcasting, which need at least julioa 1.2. Would this work for you? https://github.com/meggart/DiskArrays.jl/pull/10
Regarding the strided reading I am not sure what to do. The simplest solution would be to add a method that simply reads to whole array and then returns only the desired values. The positive thing is it would work out of the box for all DiskArray backends.
An alternative would be to extend the DiskArray interface with an optional readblock_strided
function, which takes the strides as a second argument. Then Backends like NetCDF of HDF5 which provide a specialized function for strided reading could implement this function while the other backends like Zarr could fall back to a default solution as described above.
How common is it for you to read StepRanges along a dimension? I actually never needed it, but if it is important in other fields, we can make support for this first-class.
Having issues collected here in this thread is fine. Regarding the Julia compatibility, I just created a branch that checks for the Julia version and for Julia 1.0 only implements getindex/setindex! but omits the other niceties like reductions and broadcasting, which need at least julioa 1.2. Would this work for you? meggart/DiskArrays.jl#10
Yes, this would be great. Let's try this. But if it gets too complicated we will have to drop version 1.0.
Regarding the strided reading I am not sure what to do. The simplest solution would be to add a method that simply reads to whole array and then returns only the desired values. The positive thing is it would work out of the box for all DiskArray backends.
An alternative would be to extend the DiskArray interface with an optional
readblock_strided
function, which takes the strides as a second argument. Then Backends like NetCDF of HDF5 which provide a specialized function for strided reading could implement this function while the other backends like Zarr could fall back to a default solution as described above.How common is it for you to read StepRanges along a dimension? I actually never needed it, but if it is important in other fields, we can make support for this first-class.
I do actually used strides read sometimes, e.g. to make a quick plot of a high-resolution data set. Another use-case could be making an e.g. average over 3x3 adjacent cell grid point when the dataset is too large to hold in memory (there are of course multiple way to do it, but reading with a stride of 3 seems to me the most natural).
As far as I can tell, this is implemented in python, octave and matlab. It would be sad to loose feature parity here (when dealing with very large dataset where reading the whole array is not possible) .
The readblock_strided
approach seems to me the best solution.
Maybe multiple dispatch can help here?
The NetCDF/HDF/Zarr package defines what type of indices it supports:
readblock!(v::ZarrVar, aout, r::AbstractUnitRange...)
readblock!(v::NetCDFVar, aout, r::Colon...)
# calls nc_get_var (all the data)
readblock!(v::NetCDFVar, aout, r::AbstractUnitRange...)
# calls nc_get_vara (stride = 1)
readblock!(v::NetCDFVar, aout, r::StepRange...)
# calls nc_get_vars (arbitrary stride)
In DiskArray we call define, generic, less-specific implementations:
readblock!(v, aout, r::StepRange...)
# call readblock! with unit range and then subset
Maybe multiple dispatch can help here?
Funny I was experimenting with exactly the same idea yesterday on this branch: https://github.com/meggart/DiskArrays.jl/pull/11 The only caveat here was that it is easy to run into method ambiguoities when you are too lax on the types of r
, e.g readblock!(v::ZarrVar, aout, r...)
would be ambigoous. However, I still like this approach that a package can decide if it wants to support OrdinalRange
or AbstractUnitRange
and we have the fallback solution.
So, if no one objects, I would merge the two PRs to DiskArrays soon and make a release afterwards. Then you can give your branch another try.
Great, thanks Fabian! I will test the updated version of DiskArrays. I will also make some benchmarks; so far NCDatasets is about 2x faster than Python and R. I hope we can keep this advantage...
Hi @Alexander-Barth I just wanted to let you know that the DiskArray branches are merged and tagged. What did your performance shootout result in?
Thanks you for letting me know! I am still trying the update the branch so that I passes the test suite. I have some problem with NetCDF scalars (arrays with 0 dimensions). But at this stage I am not sure what kind of changes are required where.
@meggart, do you have any option whether NetCDF scalar should a 0D-array or a regular julia scalar?
When I index the PseudoDiskArray
example with a 0D-array, I get the following error:
julia> A = randn(())
0-dimensional Array{Float64,0}:
0.19403206384705804
julia> A[1]
0.19403206384705804
julia> PDA = PseudoDiskArray(A)
Disk Array with size
julia> PDA[1]
ERROR: BoundsError: attempt to access ()
at index [1]
Stacktrace:
[1] indexed_iterate at ./tuple.jl:63 [inlined] (repeats 2 times)
[2] interpret_indices_disk(::PseudoDiskArray{Float64,0,Array{Float64,0}}, ::Tuple{Int64}) at /home/abarth/.julia/dev/DiskArrays/src/DiskArrays.jl:111
[3] getindex_disk(::PseudoDiskArray{Float64,0,Array{Float64,0}}, ::Int64) at /home/abarth/.julia/dev/DiskArrays/src/DiskArrays.jl:58
[4] getindex(::PseudoDiskArray{Float64,0,Array{Float64,0}}, ::Int64) at /home/abarth/.julia/dev/DiskArrays/src/DiskArrays.jl:182
[5] top-level scope at REPL[41]:1
[6] eval(::Module, ::Any) at ./boot.jl:330
[7] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:86
[8] run_backend(::REPL.REPLBackend) at /home/abarth/.julia/packages/Revise/Pcs5V/src/Revise.jl:1073
[9] top-level scope at none:0
But this works:
julia> PDA[:]
Reading at index
1-element Array{Float64,1}:
0.19403206384705804
I use https://github.com/meggart/DiskArrays.jl/tree/1b69b350d320d671e27c13ef5de9aed7e23f17fc
It is indeed a quite strange use-case of DiskArrays and NCDatasets to use scalar NetCDF variable. But some ocean models I work with use them (ROMS).
This is a good catch and definitely a bug in DiskArrays. I will fix this.
https://github.com/meggart/DiskArrays.jl/pull/14
The bugfix is merged and tagged. Could you try this again?
Thanks you for the fix! I confirm that 0D-Arrays work now. The next issue are arrays with unlimited dimensions (in netcdf 3.x you can have one and in netcdf 4.x you can have several). The idea is that the variable grows as needed. It seems that this is not supported in DiskArrays. I guess that an additional function is needed to tell DiskArrays if a given dimension is unlimited or not.
For your information, this is the type of code than I would like to run:
filename = tempname();
ds = NCDataset(filename,"c")
defDim(ds, "x", 10)
defDim(ds, "Time", Inf)
defVar(ds, "w", Float64, ("x", "Time"))
ds["w"][:,:] = ones(10,10)
# w should grow along the unlimited dimension
ds["w"][:,:] = ones(10,15)
I get these kind of errors:
ERROR: DimensionMismatch("new dimensions (10, 0) must be consistent with array size 100")
Stacktrace:
[1] (::Base.var"#throw_dmrsa#196")(::Tuple{Int64,Int64}, ::Int64) at ./reshapedarray.jl:41
[2] reshape(::Array{Float64,2}, ::Tuple{Int64,Int64}) at ./reshapedarray.jl:45
[3] setindex_disk!(::NCDatasets.Variable{Float64,2}, ::Array{Float64,2}, ::Function, ::Vararg{Function,N} where N) at /home/abarth/.julia/dev/DiskArrays/src/DiskArrays.jl:65
[4] setindex! at /home/abarth/.julia/dev/DiskArrays/src/DiskArrays.jl:189 [inlined]
[5] setindex!(::NCDatasets.CFVariable{Float64,2,NCDatasets.Variable{Float64,2},NCDatasets.Attributes,NamedTuple{(:fillvalue, :scale_factor, :add_offset, :calendar, :time_or...
Ok, this is of course a more tricky use case. Note that I ds["w"][1:10,1:15] = ones[10,15]
should already work with the current implementation. I am a bit hesitant to generally allow colon indexing to do everything. For example, do you think that for a disk array a
of size 10x10 this a[:,:] = ones(4,4)
should be equivalent to a[1:4,1:4] = ones(4,4)
, i.e. let the RHS determine the resolution of the Colon? Seems a bit counterintuitive to me, maybe others could comment as well.
For example, do you think that for a disk array a of size 10x10 this a[:,:] = ones(4,4) should be equivalent to a[1:4,1:4] = ones(4,4), i.e. let the RHS determine the resolution of the Colon? Seems a bit counterintuitive to me, maybe others could comment as well.
I agree this shouldn't work
ds["w"][:,:] = ones(10,10) # we should grow along the unlimited dimension ds["w"][:,:] = ones(10,15)
It seems like a strange use of :
to me too... Could you instead define a method to do that? like expand!
or something so it's clearly labelled as something new that doesn't exist in base?
In python's netCDF4, these statement work in the same way (as currently in NCDatasets):
import numpy as np
import netCDF4
f = netCDF4.Dataset('test.nc', 'w')
dim_t = f.createDimension('time', None)
nctime = f.createVariable('time', np.float32, ('time'))
nctime[:] = np.arange(0,5)
nctime[:] = np.arange(0,10)
f.close()
which produces:
netcdf test {
dimensions:
time = UNLIMITED ; // (10 currently)
variables:
float time(time) ;
data:
time = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ;
}
This is a quite reasonable choice to me as the NetCDF variable is declared as "grow-able" and it is certainly more convenient for the user than to write this:
nctime[1:length(data1)] = data1
In NetCDF4 there can be multiple unlimited dimensions, for instance if the 2nd and 3d dimensions are unlimited. This
ncdata[:,1:size(data,2),1:size(data,3)] = data
is quite unwieldy compared to just:
ncdata[:,:,:] = data
It reminds me for my Fortran 77 days, when I had the specify the data and the length of the dimensions. 😉 (and hell breaks loose if there is an inconsistency)
I totally agree that the functionality is good and useful. But using Colon
/setindex!
syntax like that is not Julian - you can't do that with Array
. It will create problems for standardising packages like DiskArrays.jl and GeoData.jl.
This irregularity is already holding back implementing a DiskArrays.jl backend here, which is holding back using DiskArrays.jl in GeoData.jl (which will be a game-changing standardisation)
Why not implement your own more flexible setindex!
method:
grow!(ncdata, data)
grow!(ncdata, data, 2)
etc...
Then you can do anything you like with it. You do loose[]
bracket setindex!
syntax, but people use view
so it's not unfamiliar.
FWIW, for testing purposes I have implemented a new branch at DiskArrays https://github.com/meggart/DiskArrays.jl/commit/11680a6a6453961582338ae97a7840e62a7bf50e which adds a resizable_indices
trait where data types can decide which indices get expanded in the way @Alexander-Barth describes.
This would mean that data types which don't explicitly enable this behavior will still not change in behavior. However, I still think this syntax would not be very Julia-like, so was holding back merging this until there is a bit more feedback.
Thanks @meggart, I can confirm that the code in https://github.com/Alexander-Barth/NCDatasets.jl/issues/79#issuecomment-610598730 now works.
For this test case: https://github.com/Alexander-Barth/NCDatasets.jl/blob/master/test/test_append2.jl#L30
I had to make a small change in DiskArrays.getsize2, because the NetCDF variable has 3D dimensions (but sliced with [:,:,1]
) and the data (i.e. the rhs) just two. Therefore, your variable a
has 3 elements but v just two which lead to an out-of-bound error copied below
function get_size2(ri, a,v)
ntuple(i->(in(i,ri) && (i <= length(v))) ? v[i] : a[i], length(a))
end
Do you agree with this change in DiskArrays?
julia> tempvar[:,:,itime] = [i+j for i = 1:length(lons), j = 1:length(lats)]
ERROR: BoundsError: attempt to access (22, 13)
at index [3]
Stacktrace:
[1] getindex(::Tuple, ::Int64) at ./tuple.jl:24
[2] #19 at /home/abarth/.julia/dev/DiskArrays/src/DiskArrays.jl:189 [inlined]
[3] ntuple(::DiskArrays.var"#19#20"{Tuple{Int64},Tuple{Int64,Int64,Int64},Tuple{Int64,Int64}}, ::Int64) at ./ntuple.jl:18
[4] get_size2(::Tuple{Int64}, ::Tuple{Int64,Int64,Int64}, ::Tuple{Int64,Int64}) at /home/abarth/.julia/dev/DiskArrays/src/DiskArrays.jl:189
[5] get_size2(::NCDatasets.Variable{Float32,3,NCDataset}, ::Array{Float32,2}) at /home/abarth/.julia/dev/DiskArrays/src/DiskArrays.jl:188
[6] setindex_disk!(::NCDatasets.Variable{Float32,3,NCDataset}, ::Array{Float32,2}, ::Function, ::Vararg{Any,N} where N) at /home/abarth/.julia/dev/DiskArrays/src/DiskArrays.jl:72
[7] setindex! at /home/abarth/.julia/dev/DiskArrays/src/DiskArrays.jl:201 [inlined]
[8] setindex!(::NCDatasets.CFVariable{Float32,3,NCDatasets.Variable{Float32,3,NCDataset},NCDatasets.Attributes{NCDataset{Nothing}},NamedTuple{(:fillvalue, :scale_factor, :add_offset, :calendar, :time_origin, :time_factor),NTuple{6,Nothing}}}, ::Array{Int64,2}, ::Colon, ::Colon, ::Int64) at /home/abarth/.julia/dev/NCDatasets/src/cfvariable.jl:548
[9] top-level scope at REPL[41]:1
[10] eval(::Module, ::Any) at ./boot.jl:331
[11] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
[12] run_backend(::REPL.REPLBackend) at /home/abarth/.julia/packages/Revise/Pcs5V/src/Revise.jl:1073
[13] top-level scope at none:0
Do you also support arrays of string? They have been introduced in NetCDF4.
I am getting this error message:
julia> include("/home/abarth/.julia/dev/NCDatasets/test/test_variable.jl");
ERROR: LoadError: Type String does not have a definite size.
Stacktrace:
[1] sizeof at ./essentials.jl:451 [inlined]
[2] estimate_chunksize at /home/abarth/.julia/dev/DiskArrays/src/chunks.jl:65 [inlined]
[3] eachchunk_view(::DiskArrays.Unchunked, ::SubArray{String,2,NCDatasets.Variable{String,2,NCDataset},Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},false}) at /home/abarth/.julia/dev/DiskArrays/src/subarrays.jl:35
[4] eachchunk(::DiskArrays.SubDiskArray{String,2}) at /home/abarth/.julia/dev/DiskArrays/src/subarrays.jl:25
[5] fill!(::DiskArrays.SubDiskArray{String,2}, ::String) at /home/abarth/.julia/dev/DiskArrays/src/ops.jl:140
[6] copyto! at ./broadcast.jl:872 [inlined]
[7] materialize!(::DiskArrays.SubDiskArray{String,2}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(identity),Tuple{Base.RefValue{String}}}) at ./broadcast.jl:823
[8] (::var"#15#18")(::NCDataset{Nothing}) at /home/abarth/.julia/dev/NCDatasets/test/test_variable.jl:61
[9] NCDataset(::var"#15#18", ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/abarth/.julia/dev/NCDatasets/src/dataset.jl:286
[10] NCDataset(::Function, ::String, ::Vararg{String,N} where N) at /home/abarth/.julia/dev/NCDatasets/src/dataset.jl:284
[11] top-level scope at /home/abarth/.julia/dev/NCDatasets/test/test_variable.jl:14
[12] include(::String) at ./client.jl:439
[13] top-level scope at REPL[3]:1
[14] eval(::Module, ::Any) at ./boot.jl:331
[15] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
[16] run_backend(::REPL.REPLBackend) at /home/abarth/.julia/packages/Revise/Pcs5V/src/Revise.jl:1073
[17] top-level scope at none:0
in expression starting at /home/abarth/.julia/dev/NCDatasets/test/test_variable.jl:14
I guess that in
estimate_chunksize(a::AbstractArray) = estimate_chunksize(size(a), sizeof(eltype(a)))
where need to assume a typical length of string to estimate the chunk size.
Similarly, in netCDF4 there are also variable-length arrays which are loaded as Array{Vector{T},N}
in NCDatasets. Can you confirm that they are withing the scope of DiskArrays?
@meggart
However, I still think this syntax would not be very Julia-like, so was holding back merging this until there is a bit more feedback.
This syntax means GeoData.jl and DimensionalData.jl will have to check index size on every setindex!
call to be able to completely support DiskArrays.jl, and have a method for rebuilding every dimension index when it grows. Otherwise the dimension index can be the wrong size after setindex!
, which usually is not possible in Julia.
The same goes for NamedDims, AxisArrays and every other package that has a dimension index - they will need to handle this unusual behaviour to support DiskArrays.jl. I'm pretty concerned about it being merged because of these consequences.
DiskArrays.jl could define another method that does this that is optionally supported, which I think would be better than having a broken setindex!
everywhere.
Even in the current master version of DiskArrays, disk arrays can grow (https://github.com/Alexander-Barth/NCDatasets.jl/issues/79#issuecomment-611495930 )
The question is just whether you need to write this like this
ds["w"][1:10,1:15] = ones(10,15)
or if you can use the shorter way (as in netCDF4 python and in the current version of NCDatasets)
ds["w"][:,:] = ones(10,15)
In both cases setindex!
does change the size of the array.
I use unlimited time dimensions quite a lot and I do not see myself writing the first variant all the time.
In the current case we at least know the size from the indices passed to setindex!
. In the proposed changes we need to check the array bounds. But I also think the current behaviour is going to be a problem.
It means that strange errors will occur in packages wrapping DiskArrays.jl if they assume standard Array
behaviour of setindex!
. Usually you can rely on julia throwing a bounds error if indices are larger than the array, so you would never check that the array size has changed.
To deal with this all wrapper packages will need to depend on DiskArrays or Requires/DiskArrays to add specific dispatch methods to handle this. Otherwise they will need to check either indices (current behaviour) or indices and array size (proposed behaviour) for every type of AbstractArray
.
To me it's preferable that methods like setindex!
can be supported or not supported at all, than that some arguments will be fine and some that look basically identical will cause strange errors. I'll have to support this behaviour either way in GeoData.jl, but it would be better in a method outside of setindex!
so we don't have to do those checks every time, and so that setindex!
is just vanilla Julia.
I use unlimited time dimensions quite a lot and I do not see myself writing the first variant all the time.
To me this discussion is about weighing up the utility of this bracket syntax against the amount of trouble this will cause in the ecosystem, not just about wether the syntax is good (on which I totally agree, it's the simplest syntax for this).
My main question is if this syntax is so much worse:
grow!(ds["w"], ones(10,15))
ds["w"][:,:] = ones(10,15)
It could also take arguments if it needs to:
grow!(ds["w"], ones(10,15), 20:30, 10:25)
# vs
ds["w"][20:30, 10:25] = ones(10,15)
I do understand both sides of the argument here.
When I started DiskArrays
I tried to be as un-intrusive as possible in the underlying package's decision. So all the package does is to translate a getindex
or setindex!
call into a respective readblock!
call resolving trailing/missing indices and Colons. The bounds check should be done by the underlying package if necessary (or it just decides to re-size the array).
So at least so far I think DiskArrays will not help you to ensure that array sizes will not change in the meantime. I think it would be good to get some feedback from the Julia community if setindex!
in itself should always keep the size of an AbstractArray. At least for Dictionaries there is one example for behavior where the Data structure grows on setindex!
, although it looks like the the new Dictionaries.jl (https://github.com/andyferris/Dictionaries.jl) broke with this grow-on-setindex behavior.
Maybe it would help if I added a keyword argument checkbounds=false
to the DiskArrays setindex implementation. So when you call setindex!
in your DimensionallArrays code you can explicitly enforce bounds checking. In addition we can still discuss what the default behavior for any DiskArray should be...