rhdf5
rhdf5 copied to clipboard
bug and performance problems with "wide" data frames
I'm trying to store a data frame to an hdf5 file. Unfortunately that fails for data frames with more than a certain number of observations/columns. I've narrowed it down to a data frame with 1 variable and 1093 observations failing while 1092 observations work. Setting a small chunk size does not change the outcome as that works on the number of rows/variables as far is I understand it and there is only one variable in this example.
Additionally, having written the smaller data frame to a file successfully it takes more than a full minute to read it back in.
Here the small sample script to illustrate both issues:
library(rhdf5)
# cleanup and create test file
h5closeAll()
if (file.exists("test.h5")) { file.remove("test.h5") }
h5createFile("test.h5")
h5createGroup("test.h5", "test")
# create test data frames
df_ok <- data.frame(t(c(1:1092)*0.5))
df_fail <- data.frame(t(c(1:1093)*0.5))
# write
h5write(df_ok, file="test.h5", name="test/df_ok")
h5write(df_fail, file="test.h5", name="test/df_fail")
# read
st <- proc.time()
df_ok_read <- h5read(file="test.h5", name="test/df_ok")
proc.time() - st
and the relevant outputs: for the write failure
Error in h5writeDataset.data.frame(obj, loc$H5Identifier, name, ...) :
HDF5. Dataset. Unable to initialize object.
and for the read back timing:
user system elapsed
85.963 0.190 86.213
Tested on a laptop with Ubuntu 18.04 64bit 16GB ram, R 3.6.0 / rhdf5 2.28.0
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 6.0
year 2019
month 04
day 26
svn rev 76424
language R
version.string R version 3.6.0 (2019-04-26)
nickname Planting of a Tree