hdf5r Benchmark with h5 package

I noticed that reading hdf5 files with the hdf5r package is much slower than with the deprecated h5 package (16 times slower in example below).

Am I doing something wrong?

   test replications elapsed relative
1    h5           10   0.041    1.000
2 hdf5r           10   0.662   16.146

MWE:

# create hdf5 file (6 vectors with 10k random numbers each)

h5file <- hdf5r::H5File$new("testdata.h5", "w")
for (i in paste0("vector", 1:6)) {
    h5file[[i]] <- runif(10000)
}
h5file$close_all()

# compare read speed when using h5 and hdf5r package

read_h5 <- function(file) {
    h5file <- h5::h5file(file, "r")
    sets <- h5::list.datasets(h5file)
    result <- lapply(sets, function(i) h5file[i][])
    h5::h5close(h5file)
    result
}
read_hdf5r <- function(file) {
    h5file <- hdf5r::H5File$new(file, "r")
    sets <- h5file$ls()$name
    result <- lapply(sets, function(i) h5file[[i]][])
    h5file$close_all()
    result
}
rbenchmark::benchmark(
    replications = 10,
    "h5"     = read_h5("testdata.h5"),
    "hdf5r"  = read_hdf5r("testdata.h5"))[,1:4]

Feb 28 '18 14:02 swissr

Hi, no you are not doing anything wrong. It is a known issue that I recently unfortunately didn't have time to dig into.

Feb 28 '18 15:02 hhoeflin

Hi Hans-Peter, a quick profiling exercise suggests that most of the time (80%) is spent in the $close_all() call:

Rprof("issue-92-hdf5r.out", line.profiling=TRUE)
h5file <- hdf5r::H5File$new("testdata.h5", "r")
sets <- h5file$ls()$name
result <- lapply(sets, function(i) h5file[[i]][])
h5file$close_all()
Rprof(NULL)
summaryRprof("issue-92-hdf5r.out", lines = "show")

$by.total
                       total.time total.pct self.time self.pct
R6Classes_H5File.R#260       0.08        80      0.08       80
R6Classes.R#43               0.02        20      0.02       20
#1                           0.02        20      0.00        0
Common_functions.R#99        0.02        20      0.00        0
high_level_UI.R#81           0.02        20      0.00        0
R6Classes_H5D.R#109          0.02        20      0.00        0
R6Classes_H5Group.R#95       0.02        20      0.00        0

Feb 28 '18 20:02 mannau

most of the time (80%) is spent in the $close_all()

Yes exactly.

When I commented out close_all() in a 'real-live-script' where hundreds of such hdf5 files are being read and processed, hdf5r took about twice the time as h5. In the above example after commenting out close_all the times for hdf5r drops to a quarter (but are still about 5 times more than h5).

(Spurred by #86 I tried with a hdf5r branch where gc in close_all has been commented out. Got an error though and stopped as I don't have much hdf5 low-level background).

Mar 01 '18 08:03 swissr

Yep, I observed similar issues and posted issue #85. I put up some benchmarks in that issues too.

Mar 09 '18 22:03 mkoohafkan

This is an old issue I know but I found similar results about close_all()'s gc() call. I was using hdf5r as part of Seurat (a biological data package) to load in a medium-sized number of h5 files, and found it was spending much more time gc()'ing than loading in data.

I removed that line in close_all and found demonstrable increases. This aligns with what I've seen when using gc in general. Oftentimes it's very fast but some deeply nested, large objects (30Gb+) can really slow gc and object.size down, which is exacerbated by opening many h5 files. I haven't yet seen any corruption/errors with the gc commented out so it could be worth revisiting and/or putting it behind a flag.

I don't fully understand what the gc does but going off of 'If not all objects in a file are closed, the file remains open and cannot be re-opened the regular way.', maybe removing gc would lead to errors if I tried to reopen a file? In which case having the user be able to open many files at once, and then close all at once, could solve both problems with some added complexity

Feb 03 '22 22:02 JZL

hdf5r hdf5r copied to clipboard

Benchmark with h5 package

hdf5r
hdf5r copied to clipboard