Performance: reading h5 file is slow compared to h5py
Hi, I have a case which need to load some h5 file into memory as a cache, the h5 file contains a lot of dataset, and a dataset contains thousand rows of 1d array.
Sorry that I can't provide such h5 file, but I can simulate a file which have similar structure in my use case.
Code to generate such h5 file
import h5py
import pandas as pd
f = h5py.File("tmp.h5", "w")
for i in range(1, 15001):
dataset_name = str(i)
data = pd.DataFrame(
{key: [1] * 3000 for key in ["a1", "a2", "a3", "a4", "a5", "a6"]}
)
data = data.astype(
{"a1": "<u4", "a2": "<f8", "a3": "<f8", "a4": "<f8", "a5": "<f8", "a6": "<u8"}
)
f.create_dataset(
dataset_name, data=data.to_records(index=False), compression=9, shuffle=False
)
Here is my generated h5 file: tmp.tar.gz
Reader code
Here is rust code:
use hdf5::{File, H5Type};
use ndarray::{s, Array1};
use std::collections::HashMap;
use std::error::Error;
use std::result::Result as StdResult;
#[derive(H5Type, Clone, PartialEq, Debug)]
#[repr(C)]
pub struct TmpData {
a1: u32,
a2: f64,
a3: f64,
a4: f64,
a5: f64,
a6: u64,
}
pub fn read_to_mem(path: &str) -> StdResult<HashMap<String, Array1<TmpData>>, Box<dyn Error>> {
let file = File::open(path)?; // open for reading
let mut result = HashMap::new();
for dataset in file.datasets()? {
let data = dataset.read_slice_1d::<TmpData, _>(s![..])?;
result.insert(dataset.name(), data);
}
Ok(result)
}
fn main() {
let _ = read_to_mem("tmp.h5").unwrap();
}
And here is python code:
import h5py
f = h5py.File("tmp.h5")
cache = {k: v[()] for k, v in f.items()}
As compared, hdf5-rust code takes 8m19s to read the whole file, but h5py code takes about 30 seconds.
I've tried to enable f16 feature, but have no luck.
Am I doing something wrong? Or how can I improve the performance?
I think this has popped up before (can't find the issue) and it was to do with hdf5 doing conversion of every compound internally, when it could have been a copy/noop. You could try creating a flamegraph to verify this.
h5py might be using a different way of reading the file compared to the naive way in this crate. We should look at this approach and copy their way of doing it.
Numpy structured arrays will produce packed layouts by default. You can check that .dtype.itemsize in your case is equal to 44, whereas for the Rust struct you have it's repr(C), so its sizeof will be 48. There's no surprise then, h5py does a direct read with zero work afterwards whereas in Rust you have mismatching layouts and you have to copy every field into its place. So, you'd want to do either:
- Use
align=Truewhen creating recarrays, then you can use it with a repr(C) struct - Use
repr(packed)on the struct, then you can use it with packed arrays