hdf5-rust icon indicating copy to clipboard operation
hdf5-rust copied to clipboard

Performance: reading h5 file is slow compared to h5py

Open WindSoilder opened this issue 2 years ago • 2 comments

Hi, I have a case which need to load some h5 file into memory as a cache, the h5 file contains a lot of dataset, and a dataset contains thousand rows of 1d array.

Sorry that I can't provide such h5 file, but I can simulate a file which have similar structure in my use case.

Code to generate such h5 file

import h5py
import pandas as pd

f = h5py.File("tmp.h5", "w")

for i in range(1, 15001):
    dataset_name = str(i)
    data = pd.DataFrame(
        {key: [1] * 3000 for key in ["a1", "a2", "a3", "a4", "a5", "a6"]}
    )
    data = data.astype(
        {"a1": "<u4", "a2": "<f8", "a3": "<f8", "a4": "<f8", "a5": "<f8", "a6": "<u8"}
    )
    f.create_dataset(
        dataset_name, data=data.to_records(index=False), compression=9, shuffle=False
    )

Here is my generated h5 file: tmp.tar.gz

Reader code

Here is rust code:

use hdf5::{File, H5Type};
use ndarray::{s, Array1};
use std::collections::HashMap;
use std::error::Error;
use std::result::Result as StdResult;

#[derive(H5Type, Clone, PartialEq, Debug)]
#[repr(C)]
pub struct TmpData {
    a1: u32,
    a2: f64,
    a3: f64,
    a4: f64,
    a5: f64,
    a6: u64,
}

pub fn read_to_mem(path: &str) -> StdResult<HashMap<String, Array1<TmpData>>, Box<dyn Error>> {
    let file = File::open(path)?; // open for reading
    let mut result = HashMap::new();
    for dataset in file.datasets()? {
        let data = dataset.read_slice_1d::<TmpData, _>(s![..])?;
        result.insert(dataset.name(), data);
    }
    Ok(result)
}


fn main() {
    let _ = read_to_mem("tmp.h5").unwrap();
}

And here is python code:

import h5py
f = h5py.File("tmp.h5")
cache = {k: v[()] for k, v in f.items()}

As compared, hdf5-rust code takes 8m19s to read the whole file, but h5py code takes about 30 seconds.

I've tried to enable f16 feature, but have no luck.

Am I doing something wrong? Or how can I improve the performance?

WindSoilder avatar Jan 16 '24 02:01 WindSoilder

I think this has popped up before (can't find the issue) and it was to do with hdf5 doing conversion of every compound internally, when it could have been a copy/noop. You could try creating a flamegraph to verify this.

h5py might be using a different way of reading the file compared to the naive way in this crate. We should look at this approach and copy their way of doing it.

mulimoen avatar Jan 16 '24 10:01 mulimoen

Numpy structured arrays will produce packed layouts by default. You can check that .dtype.itemsize in your case is equal to 44, whereas for the Rust struct you have it's repr(C), so its sizeof will be 48. There's no surprise then, h5py does a direct read with zero work afterwards whereas in Rust you have mismatching layouts and you have to copy every field into its place. So, you'd want to do either:

  • Use align=True when creating recarrays, then you can use it with a repr(C) struct
  • Use repr(packed) on the struct, then you can use it with packed arrays

aldanor avatar Jan 30 '24 20:01 aldanor