Rasters.jl
Rasters.jl copied to clipboard
Where selection for Netcdf raster slower than for in Memory raster with same dimensions
When I am doing a Where Selector it seems to be quite slow compared to other selection methods.
I expect that the Where selector is slower because it has to check the function for every element of the axis.
But in my case the selection takes 2 seconds while the selection function takes 30 nanoseconds and when I multiply it with the number of elements I get 10 microseconds.
This selection is much faster when I do it with random data that I generate compared to the selection on a netcdf based Raster.
When I construct a Raster with the same dimensions as ras but a random array the timing is similar to the fully random based Raster.
# Selection on random data
julia> r = Raster(rand(800,800, 289), (X(1:800), Y(1:800), Ti(DateTime(1995):Month(1):DateTime(2019))))
800×800×289 Raster{Float64,3} with dimensions:
X Sampled{Int64} 1:800 ForwardOrdered Regular Points,
Y Sampled{Int64} 1:800 ForwardOrdered Regular Points,
Ti Sampled{DateTime} DateTime("1995-01-01T00:00:00"):Month(1):DateTime("2019-01-01T00:00:00") ForwardOrdered Regular Points
extent: Extent(X = (1, 800), Y = (1, 800), Ti = (DateTime("1995-01-01T00:00:00"), DateTime("2019-01-01T00:00:00")))
missingval: missing
parent:
julia> @btime r[Ti=Where(x->year(x)==1995)]
29.809 ms (19 allocations: 58.60 MiB)
800×800×12 Raster{Float64,3} with dimensions:
X Sampled{Int64} 1:800 ForwardOrdered Regular Points,
Y Sampled{Int64} 1:800 ForwardOrdered Regular Points,
Ti Sampled{DateTime} DateTime[1995-01-01T00:00:00, …, 1995-12-01T00:00:00] ForwardOrdered Irregular Points
extent: Extent(X = (1, 800), Y = (1, 800), Ti = (DateTime("1995-01-01T00:00:00"), DateTime("1995-12-01T00:00:00")))
missingval: missing
parent:
# results of a NetCDF backed raster
julia> @btime ras[Ti=Where(x->year(x)==1995)]
1.994 s (643532 allocations: 240.25 MiB)
848×824×12 Raster{Union{Missing, Float32},3} with dimensions:
Dim{:rlon} Sampled{Float64} -28.4029998779297:0.0550011811599821:18.18300056457514 ForwardOrdered Regular Points,
Dim{:rlat} Sampled{Float64} -23.4029998779297:0.0550012159753107:21.863000869751005 ForwardOrdered Regular Points,
Ti Sampled{DateTime} DateTime[1995-01-16T12:00:00, …, 1995-12-16T12:00:00] ForwardOrdered Irregular Points
extent: Extent(rlon = (-28.4029998779297, 18.18300056457514), rlat = (-23.4029998779297, 21.863000869751005), Ti = (DateTime("1995-01-16T12:00:00"), DateTime("1995-12-16T12:00:00")))
missingval: missing
# Results of a random array with the same dimensions as ras
ulia> ra = Raster(rand(size(ras)...), dims(ras))
848×824×288 Raster{Float64,3} with dimensions:
Dim{:rlon} Sampled{Float64} -28.4029998779297:0.0550011811599821:18.18300056457514 ForwardOrdered Regular Points,
Dim{:rlat} Sampled{Float64} -23.4029998779297:0.0550012159753107:21.863000869751005 ForwardOrdered Regular Points,
Ti Sampled{DateTime} DateTime[1995-01-16T12:00:00, …, 2018-12-16T12:00:00] ForwardOrdered Irregular Points
extent: Extent(rlon = (-28.4029998779297, 18.18300056457514), rlat = (-23.4029998779297, 21.863000869751005), Ti = (DateTime("1995-01-16T12:00:00"), DateTime("2018-12-16T12:00:00")))
missingval: missing
julia> @btime ra[Ti=Where(x->year(x)==1995)]
32.535 ms (19 allocations: 63.97 MiB)
848×824×12 Raster{Float64,3} with dimensions:
Dim{:rlon} Sampled{Float64} -28.4029998779297:0.0550011811599821:18.18300056457514 ForwardOrdered Regular Points,
Dim{:rlat} Sampled{Float64} -23.4029998779297:0.0550012159753107:21.863000869751005 ForwardOrdered Regular Points,
Ti Sampled{DateTime} DateTime[1995-01-16T12:00:00, …, 1995-12-16T12:00:00] ForwardOrdered Irregular Points
extent: Extent(rlon = (-28.4029998779297, 18.18300056457514), rlat = (-23.4029998779297, 21.863000869751005), Ti = (DateTime("1995-01-16T12:00:00"), DateTime("1995-12-16T12:00:00")))
missingval: missing
[and 11 more slices...]
This might be a Rasters issue in the end. There is a difference between a YAXArrays and a Raster on the same data.
The above timings are from a Raster wrapping a YAXArray and this might contribute to the long timings.
But when I do the selection on a Raster compared to a YAXArray the selection is a factor four roughly slower. And the YAXArrays selection is also a factor two slower than the selection on an in memory based Raster.
urls = Dict(y=> "http://esgf1.dkrz.de/thredds/fileServer/cosmo-rea/reanalysis/EUR-6km/DWD/ECMWF-ERAINT/REA6/r1i1p1f1/COSMO/v1/mon/atmos/tas/v20230314/tas_EUR-6km_ECMWF-ERAINT_REA6_r1i1p1f1_COSMO_v1_mon_$(y)01-$(y)12.nc" for y in 1995:2018)
filenames = Dict(y=> joinpath("data", split(url, "/")[end]) for (y,url) in urls)
mkpath("data/")
#pathdict = Dict(y => download(url, filenames[y]) for (y, url) in urls)
# After download this should be this:
pathdict = [glob("tas_EUR*$y*.nc", "/home/fcremer/Documents/NFDI4Earth/deliverables/data")[1] for y in 1995:2018]
rlist = Raster.(pathdict, lazy=true, key="tas");
r = cat(rlist..., dims=Ti)
julia> dslist = open_dataset.(pathdict)
taslist = getproperty.(dslist, :tas)
c = cat(taslist..., dims=Ti)
julia> @btime c[Ti=Where(x->year(x)==1995)];
66.370 ms (26354 allocations: 1.98 MiB)
julia> @btime r[Ti=Where(x->year(x)==1995)];
232.956 ms (7357 allocations: 232.82 MiB)
This looks like the problem:
(643532 allocations: 240.25 MiB)
Which is probably because youre using lazy=true with NCDatasets.jl... maybe try main ?
Is this still true? the files dont seem to be there anymore to test it.
The data was moved. I try to test it this week.
The location of the files have been moved. The new position should be this:
path = "http://esgf1.dkrz.de/thredds/fileServer/cosmo-rea/reanalysis/EUR-6km/DWD/ECMWF-ERAINT/REA6/r1i1p1f1/COSMO/v1/mon/atmos/tas/v20230918/tas_EUR-6km_ECMWF-ERAINT_REA6_r1i1p1f1_COSMO_v1_mon_199501-199512.nc"
I will try it again in the next days.
I tested it again and these are the new timings:
julia> @btime c[Ti=Where(x->year(x)==1995)]; # YAXArrays with NetCDF backend
71.301 ms (26373 allocations: 1.98 MiB)
julia> @btime r[Ti=Where(x->year(x)==1995)]; # Rasters with NCDatasets backend
252.776 ms (6246 allocations: 232.31 MiB)
This might also be some difference between the backends.
This is with these package versions:
(Rasters) pkg> st
Project Rasters v0.11.1
Status `~/.julia/dev/Rasters/Project.toml`
[79e6a3ab] Adapt v4.0.4
[3da002f7] ColorTypes v0.11.5
[1fbeeb36] CommonDataModel v0.3.6
[187b0558] ConstructionBase v1.5.5
⌃ [0703355e] DimensionalData v0.27.2
⌅ [3c3547ce] DiskArrays v0.3.23
[411431e0] Extents v0.1.2
[1a297f60] FillArrays v1.11.0
[4c728ea3] Flatten v0.4.3
[68eda718] GeoFormatTypes v0.4.2
[cf35fbd7] GeoInterface v1.3.4
[e1d29d7a] Missings v1.2.0
[30363a11] NetCDF v0.11.8
[6fe1bfb0] OffsetArrays v1.14.0
[92933f4c] ProgressMeter v1.10.0
[3cdcf5f2] RecipesBase v1.3.4
[189a3867] Reexport v1.2.2
[efcf1570] Setfield v1.1.1
[c21b50f5] YAXArrays v0.5.6
[ade2ca70] Dates
[a63ad114] Mmap
Ok this not just the time to load the data?
getindex is not lazy in Rasters like in YAXArrays, it moves data to memory. But view is lazy!
Ah yeah right. I forgot that. So this is a useless comparison. I will test the view but most likely this can be closed
It is kind of annoying that the lazy syntax is harder to write in julia generally.