matio
matio copied to clipboard
Compression failure caused by incorrect ChunkSize
In the Mat_H5GetChunkSize function in the mat73.c file...
Mat_H5GetChunkSize(size_t rank, hsize_t *dims, hsize_t *chunk_dims)
{
hsize_t i, j, chunk_size = 1;
for ( i = 0; i < rank; i++ ) {
chunk_dims[i] = 1;
for ( j = 4096 / chunk_size; j > 1; j >>= 1 ) {
if ( dims[i] >= j ) {
chunk_dims[i] = j;
break;
}
}
chunk_size *= chunk_dims[i];
}
}
It appears that the intention of the code is to find the optimal Chunk size that maximizes compression efficiency. In reality, the above code results in the compressed file size being doubled.
Let's consider you have a dataset with dimensions (17, 1000) and you are storing this data using HDF5 with compression enabled. The optimal ChunkSize obtained from the code mentioned above is ChunkSize = (16, 512)
In the case with a ChunkSize of (16, 512), the data is divided into chunks of (16, 512). Since your data's dimension along the first axis is 17, it means you will need 2 chunks along that axis. Along the second axis, which has dimension 1000, you will need 2 chunks as well. So, in total, you will have 4 chunks to store the entire data. However, because the chunk size is not an exact fit for the data dimensions, there will be some extra space in each chunk that is not used efficiently. This can lead to an increase in the file size.
In the second case, ChunkSize = (9, 512). Although you still need a total of 4 chunks to store the data. However, because each chunk occupies less space. This results in a more efficient storage utilization and a smaller file size compared to the first case.
In practical testing, when continuously appending data with dimensions of (17, 1000), the sizes of the two files differ by approximately a factor of two, aligning with the initial hypothesis.
Thanks for bringing this topic up. I evaluated the code of Mat_H5GetChunkSize
and also compared it with the auto-chunk feature of h5py.
I see that Mat_H5GetChunkSize
always sets the chunk dimensions to powers of 2 where the maximal chunk size is 4096 elements. This indeed might be inappropriate for certain/many cases.
Let's consider you have a dataset with dimensions (17, 1000) and you are storing this data using HDF5 with compression enabled. The optimal ChunkSize obtained from the code mentioned above is ChunkSize = (16, 512)
It is (16, 256), right? But it does not influence your follow-up reasoning.
What is your proposal?
- Improve
Mat_H5GetChunkSize
in the same way asguess_chunk
of h5py. - Offer an public API to manually set the chunk size for datasets of HDF5 MAT variables.
- Keep as is, but document better.
- Increase the maximal chunk size from 4096 to some higher value.
Thanks again for your feedback.
@allwaysFindFood Any feedback would be appreciated.