NCDatasets.jl icon indicating copy to clipboard operation
NCDatasets.jl copied to clipboard

Performance and array semantics

Open simonbyrne opened this issue 6 years ago • 3 comments

Calling functions which iterate over array elements (e.g. Statistics.mean) on netcdf datasets can be very slow. It would be useful to have some performance tips to e.g. first convert a dataset to an Array.

On that note, I noticed that simply calling Array(dataset) is also slow. I take it from the examples in the manual that the suggested way to convert is to call dataset[:]. However this has different behaviour from ordinary Julia multidimensional arrays, which return a vector:

julia> X = rand(3,3);

julia> X[:]
9-element Array{Float64,1}:
 0.1443316789104081
 0.8768272466421989
 0.01240381950170022
 0.6732249425627772
 0.4128628021781906
 0.2112718766221251
 0.9459472658879167
 0.8996044010964837
 0.47806340748050236

simonbyrne avatar May 29 '19 16:05 simonbyrne

Thank you for your insightful comments!

Indeed, Statistics.mean (and friends) are not overloaded and thus they are fetching every element individually from the netCDF file which is slow. I added a section in the documentation about performance as suggested.

It never occurred to me, to load an array in RAM with the Array constructor, but it makes sense and I added a method for this which give a speed-up as expected. If myvar is a 100x100 Float64 Matrix, before we had:

julia> @btime Array(ds["myvar"]) 
  115.359 ms (170042 allocations: 8.93 MiB)

Now after my recent commit:

julia> @btime Array(ds["myvar"]);
  69.404 μs (49 allocations: 170.43 KiB)

It is true that ncvar[:] is not consistent with the behaviour of regular Julia arrays, but it was a price that I was willing to pay for a concise syntax. In fact, the first NetCDF library that I used was the netcdf matlab toolbox (now abandoned) which implemented exactly the same (slightly inconsistent) behaviour in matlab.

What alternatives could be implemented?

ds["myvar"][:,:]  # does not work so well if the dimensionality of myvar is not known a priori
Array(ds["myvar"]) # works now, but it feels a bit wordy to me
ds["myvar"][] # like references, but it looks a bit obscure to most people I guess

Alexander-Barth avatar May 29 '19 20:05 Alexander-Barth

Great, thanks!

I don't think there's anything wrong with Array(ds["myvar"]). If you want something that would work for scalars as well, it would probably work to overload copy?

simonbyrne avatar May 29 '19 20:05 simonbyrne

I had a look at HDF5.jl (https://github.com/JuliaIO/HDF5.jl/blob/master/doc/hdf5.md#reading-and-writing-data) and they use read, for example:

variable = read(ds["myvar"])

Alexander-Barth avatar Jun 05 '19 09:06 Alexander-Barth