hdf5r
hdf5r copied to clipboard
Benchmark with h5 package
I noticed that reading hdf5 files with the hdf5r
package is much slower than with the deprecated h5
package (16 times slower in example below).
Am I doing something wrong?
test replications elapsed relative
1 h5 10 0.041 1.000
2 hdf5r 10 0.662 16.146
MWE:
# create hdf5 file (6 vectors with 10k random numbers each)
h5file <- hdf5r::H5File$new("testdata.h5", "w")
for (i in paste0("vector", 1:6)) {
h5file[[i]] <- runif(10000)
}
h5file$close_all()
# compare read speed when using h5 and hdf5r package
read_h5 <- function(file) {
h5file <- h5::h5file(file, "r")
sets <- h5::list.datasets(h5file)
result <- lapply(sets, function(i) h5file[i][])
h5::h5close(h5file)
result
}
read_hdf5r <- function(file) {
h5file <- hdf5r::H5File$new(file, "r")
sets <- h5file$ls()$name
result <- lapply(sets, function(i) h5file[[i]][])
h5file$close_all()
result
}
rbenchmark::benchmark(
replications = 10,
"h5" = read_h5("testdata.h5"),
"hdf5r" = read_hdf5r("testdata.h5"))[,1:4]
Hi, no you are not doing anything wrong. It is a known issue that I recently unfortunately didn't have time to dig into.
Hi Hans-Peter,
a quick profiling exercise suggests that most of the time (80%) is spent in the $close_all()
call:
Rprof("issue-92-hdf5r.out", line.profiling=TRUE)
h5file <- hdf5r::H5File$new("testdata.h5", "r")
sets <- h5file$ls()$name
result <- lapply(sets, function(i) h5file[[i]][])
h5file$close_all()
Rprof(NULL)
summaryRprof("issue-92-hdf5r.out", lines = "show")
$by.total
total.time total.pct self.time self.pct
R6Classes_H5File.R#260 0.08 80 0.08 80
R6Classes.R#43 0.02 20 0.02 20
#1 0.02 20 0.00 0
Common_functions.R#99 0.02 20 0.00 0
high_level_UI.R#81 0.02 20 0.00 0
R6Classes_H5D.R#109 0.02 20 0.00 0
R6Classes_H5Group.R#95 0.02 20 0.00 0
most of the time (80%) is spent in the $close_all()
Yes exactly.
When I commented out close_all()
in a 'real-live-script' where hundreds of such hdf5 files are being read and processed, hdf5r took about twice the time as h5. In the above example after commenting out close_all
the times for hdf5r drops to a quarter (but are still about 5 times more than h5).
(Spurred by #86 I tried with a hdf5r branch where gc
in close_all
has been commented out. Got an error though and stopped as I don't have much hdf5 low-level background).
Yep, I observed similar issues and posted issue #85. I put up some benchmarks in that issues too.
This is an old issue I know but I found similar results about close_all()
's gc()
call. I was using hdf5r as part of Seurat (a biological data package) to load in a medium-sized number of h5 files, and found it was spending much more time gc()
'ing than loading in data.
I removed that line in close_all
and found demonstrable increases. This aligns with what I've seen when using gc
in general. Oftentimes it's very fast but some deeply nested, large objects (30Gb+) can really slow gc
and object.size
down, which is exacerbated by opening many h5 files. I haven't yet seen any corruption/errors with the gc
commented out so it could be worth revisiting and/or putting it behind a flag.
I don't fully understand what the gc
does but going off of 'If not all objects in a file are closed, the file remains open and cannot be re-opened the regular way.', maybe removing gc would lead to errors if I tried to reopen a file? In which case having the user be able to open many files at once, and then close all at once, could solve both problems with some added complexity