hdf5 icon indicating copy to clipboard operation
hdf5 copied to clipboard

H5Dwrite execution time in Windows increases in subsequent runs when a variable is written with chunking and compression

Open abhibaruah opened this issue 1 year ago • 1 comments

Hello all,

I am trying to write a dataset of size 2000 x 512 x 512 with chunking and deflate compression (Chunk Size = 20 x 10 x 10, Deflate Level = 3). I am timing the call to H5Dwrite. I run this code in a for loop of 10 iterations and at the end of every iteration, delete the created h5 file.

I have noticed that for each successive run, the time taken by H5Dwrite increases. This issue is not seen if I remove the chunking and compression. I also tried the reproduction code to create a dataset of smaller size, but I could not reproduce the issue.

The issue is seen only on Windows and I could see the issue in HDF5 1.10.10 and 1.10.11. Please find the reproduction C++ code below. Here is the output from the program:

In main Index: 0 Before H5Dwrite After H5Dwrite Execution time: 11.0488 seconds Index: 1 Before H5Dwrite After H5Dwrite Execution time: 11.9102 seconds Index: 2 Before H5Dwrite After H5Dwrite Execution time: 13.3122 seconds Index: 3 Before H5Dwrite After H5Dwrite Execution time: 15.4624 seconds Index: 4 Before H5Dwrite After H5Dwrite Execution time: 21.7152 seconds Index: 5 Before H5Dwrite After H5Dwrite Execution time: 25.9505 seconds Index: 6 Before H5Dwrite After H5Dwrite Execution time: 27.9015 seconds Index: 7 Before H5Dwrite After H5Dwrite Execution time: 32.6089 seconds Index: 8 Before H5Dwrite After H5Dwrite Execution time: 35.6063 seconds Index: 9 Before H5Dwrite After H5Dwrite Execution time: 37.548 seconds

P.S: I saw a similar behavior in netCDF as well and think that it might be related (since netCDF4 uses HDF5 under the hood). I have created a netCDF ticket: https://github.com/Unidata/netcdf-c/issues/2750 but no one has responded there yet.

Kindly take a look at my reproduction script and let me know if there is anything I am doing wrong, or if this is a bug with HDF5.

Platform (please complete the following information)

  • HDF5 version (if building from a maintenance branch, please include the commit hash) : 1.10.10, 1.10.11
  • OS and version : Windows 10
  • Compiler and version : Visual Studio 2022 v17
  • Build system (e.g. CMake, Autotools) and version
  • Any configure options you specified
#include "hdf5.h"
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <chrono>
#include <array>
#include <random>

#define FILE            "test_file_CPP.h5"
#define DATASET         "DS1"
#define DIM0            2000
#define DIM1            512
#define DIM2            512
#define CHUNK0          20
#define CHUNK1          10
#define CHUNK2          10

void execution(double* arr)
{
    hid_t           file, space, dset, dcpl;    /* Handles */
    herr_t          status;
    hsize_t         dims[3] = { DIM0, DIM1, DIM2 },
        mem_dims[3] = { DIM2, DIM1, DIM0 },
        chunk[3] = { CHUNK0, CHUNK1, CHUNK2 },
        max_dims[3] = { DIM0, DIM1, DIM2 };


    /*
     * Create a new file using the default properties.
     */
    file = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

    /*
     * Create the dataset creation property list, and set the chunk
     * size.
     */
    dcpl = H5Pcreate(H5P_DATASET_CREATE);

    status = H5Pset_chunk(dcpl, 3, chunk);
    status = H5Pset_deflate(dcpl, 3);


    /*
     * Create dataspace.  Setting maximum size to NULL sets the maximum
     * size to be the current size.
     */
    space = H5Screate_simple(3, dims, dims);

    /*
     * Create the chunked dataset.
     */
    dset = H5Dcreate(file, DATASET, H5T_NATIVE_DOUBLE, space, H5P_DEFAULT, dcpl,
        H5P_DEFAULT);
    //dset = H5Dcreate1 (file, DATASET, H5T_NATIVE_DOUBLE, space, dcpl);
    std::chrono::time_point<std::chrono::high_resolution_clock> start, end;

    start = std::chrono::high_resolution_clock::now();
    /*
     * Write the data to the dataset.
     */
    std::cout << "Before H5Dwrite" << std::endl;
    status = H5Dwrite(dset, H5T_NATIVE_DOUBLE, H5S_ALL, H5S_ALL, H5P_DEFAULT,
        arr);
    std::cout << "After H5Dwrite" << std::endl;

    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration = end - start;
    double durationInSeconds = duration.count();
    std::cout << "Execution time: " << durationInSeconds << " seconds" << std::endl;

    /*
     * Close and release resources.
     */
    
    status = H5Dclose(dset);
    status = H5Sclose(space);
    status = H5Pclose(dcpl);
    status = H5Fclose(file);

    

}

int main() {

    // Dynamically allocate memory for the 3D array
    double* arr = new double[DIM0 * DIM1 * DIM2];
    int index = 0;
    // Traverse the 3D array
    for (int i = 0; i < DIM0; i++) {
        for (int j = 0; j < DIM1; j++) {
            for (int k = 0; k < DIM2; k++) {
                *(arr + index) = 5;
                index++;

            }
        }
    }

    std::cout << "In main" << std::endl;

    for (int i = 0; i < 10; i++)
    {
        std::cout << "Index: " << i << std::endl;
        execution(arr);
        remove(FILE);
    }

    delete[] arr;
}

abhibaruah avatar Nov 16 '23 19:11 abhibaruah

I also accidently discovered now that this issue happens only for the chunk sizes of {20,10,10}. I changed the chunk size to {50, 30, 10} and I could not reproduce the issue any more.

abhibaruah avatar Mar 20 '24 15:03 abhibaruah