hdf5r
hdf5r copied to clipboard
significant performance loss after transitioning from `h5` package
I'm seeing significant performance loss after switching from h5
to hdf5r
. With h5
, I was able to use the lower-level openDataSet
and readDataSet
commands which were significantly faster than accessing with [[
. Using the hdf5r
object methods open()
and read()
is faster than [[
but still much slower than h5
.
Consider the following microbenchmark test (sample HDF file available here):
get_dataset_hdf5r = function(f, table.path) {
x = hdf5r::H5File$new(f)
g = x$open(table.path)
res = g$read()
g$close()
x$close()
res
}
get_dataset_h5 = function(f, table.path, type = "double") {
x = h5::h5file(f)
g = h5::openDataSet(x, table.path, type)
res = h5::readDataSet(g)
h5::h5close(g)
h5::h5close(x)
res
}
myfile = system.file("sample-data/SampleQuasiUnsteady.hdf", package = "RAStestR")
mytable = "Results/Sediment/Output Blocks/Sediment/Sediment Time Series/Cross Sections/Vol Bed Change Cum"
library(microbenchmark)
microbenchmark(
get_dataset_hdf5r(myfile, mytable),
get_dataset_h5(myfile, mytable)
)
My results:
Unit: milliseconds
expr min lq mean median uq max neval
get_dataset_hdf5r(myfile, mytable) 4.053642 4.197638 11.082077 4.313332 4.538343 552.76297 100
get_dataset_h5(myfile, mytable) 1.606342 1.670565 2.271489 1.738831 1.833065 46.95775 100
This ignores the additional cost of transposing the result of the hdf5r
method to match the structure of h5
outputs.
Actually, using [[
in hdf5r
appears to be slightly faster than open/read
methods, but still slower than h5
:
get_dataset_hdf5r_b = function(f, table.path) {
x = hdf5r::H5File$new(f)
res = x[[table.path]][,]
x$close()
res
}
microbenchmark(
get_dataset_hdf5r_b(myfile, mytable),
get_dataset_hdf5r(myfile, mytable),
get_dataset_h5(myfile, mytable)
)
Unit: milliseconds
expr min lq mean median uq max neval
get_dataset_hdf5r_b(myfile, mytable) 3.162613 3.364921 3.664380 3.446250 3.592112 8.197476 100
get_dataset_hdf5r(myfile, mytable) 3.240986 3.455113 3.905648 3.569875 3.758032 8.045706 100
get_dataset_h5(myfile, mytable) 1.596080 1.676941 1.799975 1.733855 1.822180 5.037663 100
Hi Michael,
I assume the performance penalty is from creating and destroying the R6 objects representing the datasets and the file object. If you open very many small datasets in different files thousands of times, there will indeed be a significant performance penalty. This may also occur if you open datasets very often and then discard the corresponding R6 pointer to the dataset. It was intended to open large datasets and read large amounts of data from them.
There is a way to handle this by working with the raw ids, which is a bit tricky as you have to take care of closing them on your own.
I assume this is an artificial example? Can you explain a little more your use case where this performance penalty becomes felt?
Thanks
Another benchmark---doesn't look like opening/closing is the bottleneck:
read_dataset_hdf5r = function(x, table.path) {
res = x[[table.path]][,]
res
}
read_dataset_h5 = function(x, table.path, type = "double") {
g = h5::openDataSet(x, table.path, type)
res = h5::readDataSet(g)
h5::h5close(g)
res
}
file_h5 = h5::h5file(myfile)
file_hdf5r = hdf5r::H5File$new(myfile)
microbenchmark(
read_dataset_hdf5r(file_hdf5r, mytable),
read_dataset_h5(file_h5, mytable)
)
file_hdf5r$close()
h5close(file_h5)
Results:
Unit: microseconds
expr min lq mean median uq max neval
read_dataset_hdf5r(file_hdf5r, mytable) 1513.663 1571.355 1816.8625 1616.139 1666.3670 10526.595 100
read_dataset_h5(file_h5, mytable) 290.791 317.848 381.4583 350.193 375.3845 2499.861 100
I have a package for reading HDF5 outputs from HEC-RAS software package. The main usage modes are (1) dynamically accessing multiple tables from a single HDF file for exploratory analysis of results, and (2) comparing outputs of multiple similarly-structured HDF files. In general, this results in reading a few small tables (for output metadata) and a few larger tables (for actual results). I expect my users to be fairly unfamiliar with working with hdf files, so I have structured the package so that HDF file connections are managed for the user (i.e. users pass in filenames rather than HDF5 objects, and the files are opened/closed within the function).
I initially assumed that opening/closing was a bottleneck as well and have already put some time into optimizing my package to limit the number of open/close actions, but the performance hit is still much larger than expected (and frankly, too large for my use case).
I would be interested in working with the raw IDs if that will actually get performance closer to h5
standards. However, I do plan to publish my package which means I need to rely on exported methods from hdf5r
in order to pass CRAN checks.
Reading this the benchmarks suggest both hdf5r and h5 spend roughly 80% of their time opening and closing the file. For hdf5r, opening the dataset is another bottleneck.
I will have to look into this use case to see if this can be made faster.
some additional benchmarks. First, benchmarks for opening/closing a table in an already-opened file:
library(microbenchmark)
myfile = system.file("sample-data/SampleQuasiUnsteady.hdf", package = "RAStestR")
mytable = "Results/Sediment/Output Blocks/Sediment/Sediment Time Series/Cross Sections/Vol Bed Change Cum"
h5_openclose = function() {
table_h5 = h5::openDataSet(file_h5, mytable)
h5::h5close(table_h5)
}
hdf5r_openclose = function() {
table_hdf5r = file_hdf5r$open(mytable)
table_hdf5r$close()
}
file_h5 = h5::h5file(myfile)
file_hdf5r = hdf5r::H5File$new(myfile)
microbenchmark(
h5_openclose(),
hdf5r_openclose()
)
Unit: microseconds
expr min lq mean median uq max neval
h5_openclose() 176.030 190.4915 242.0152 215.0605 233.8765 2144.997 100
hdf5r_openclose() 1065.191 1105.1555 1226.5023 1138.4320 1175.4425 3720.234 100
Second, benchmarks for reading already-opened tables:
table_h5 = h5::openDataSet(file_h5, mytable)
table_hdf5r = file_hdf5r$open(mytable)
microbenchmark(
table_hdf5r$read(),
h5::readDataSet(table_h5)
)
Unit: microseconds
expr min lq mean median uq max neval
table_hdf5r$read() 569.139 592.1535 611.7184 601.3275 616.2565 1083.851 100
h5::readDataSet(table_h5) 84.905 97.8120 110.3982 113.8290 117.7160 266.843 100
Thanks for the benchmark. I will have a look at them soon.
Any updates on this?
Hi Michael, I just revisited the issue and found that
- Opening datasets in hdf5r is significantly slower than h5 - as Holger already mentioned R6 object instantiation in H5GTD_factory seems to be the reason: https://github.com/hhoeflin/hdf5r/blob/35074993e414d5920b61f8401feada01c76a82e3/R/Common_functions.R#L99
- Reading datasets is also slower but I need to take a closer look to find out what's happening here.
To me there are no clear solutions yet. To address 1. my ideas would be to ad 1.1) Create something like dataset collection objects (e.g. through instantiation with single [) which do not instantiate each dataset separately and implement e.g. iterators. ad 1.2) Switch from R6 to S3 I would clearly prefer ad 1.1
Cheers, mario