HDF.PInvoke
HDF.PInvoke copied to clipboard
HDF dataset strides storage (pot. enhancement?)
@gheber Should I post this to [HDF-forum]?
The original element datatype of a dataset is stored by HDF5 and is required to identify the way to actually read back the elements into memory. The HDF API allows to provide the dataset values as arbitrary types. Internally the conversion is done transparently to the user. Also, the dimension lengths are stored as obligatory dataset attribute.
In order to be compatible to arbitrary applications, next to the dimension lengths and the element type, a 3rd information would be necessary to store and manage all properties of the array (let's call the dataset this way here): the strides of the dimensions. This would enable HDF5 to handle row-major vers. column-major order, and also any other custom stride the dataset storage might possibly be stored in.
The HDF5 dataset documentation (cannot find the exact place right now) points out the difference in storage scheme when using different languages. Datasets stored in ILNumerics / FORTRAN will have the dimensions flipped when read in C and vice versa.
Currently, I see the following options to handle this situation:
- The user of HDF5 needs to keep the actual dimensions in mind. Dimensions need to be flipped manually when reading data which were stored using the other convention.
- The hyperslab feature seems to be flexible enough to be used to read from the dataset storage utilizing arbitrary strides (?). Drawback: the flexibility comes to the price of a performance drop compared to sequential memory reads.
HDF5 stores the dataset in _row major _order, no matter what. Currently, a user who wants the data to be readable using any storage convention, could / must use an attribute in order to attach this information to the dataset. The necessary dimension permutations must be implemented on a layer outside of HDF5 still. There is no standard place / attribute name which is recommended for this (?).
I am interested in your experiences regarding this issue and in your recommendation on how to handle different storage schemes. Should we consider HDF5 as 'being row-major' only? Or is it rather meant to be a general storage API, capable of handling any storage scheme?
Maybe there are already plans to incorporate this extra bit into HDF? Having the information of the storage scheme of the dataset available would help the user of HDF5 to identify the necessary actions to take in order to clean up any dimension order mess, at least. The attribute option is a reasonable fallback. But an official 'standard' place to store / read the strides could help compatibility significantly. I am not aware of all potential side effects in HDF5 but to define a standard attribute might be possible without introducing too huge problems? [For an optimal solution, HDF would auto-permute the dimensions, but this would obviously be more demanding.]
What do you think?
We might want to split this into pieces... My first comment is that there is only one "storage scheme". If there is documentation suggesting otherwise, it's erroneous and we should correct it. All dataset elements are stored in the file in C order. Multi-dimensional array data is stored in C order; in other words, the “last” dimension changes fastest. This is true for contiguous layout, but also for the arrangement of dataset elements in a chunk.
In other words, there is no way to tell if a dataset was written by a C or Fortran program. It always ends up in the file in C-order.
How can one reduce the performance penalty for applications who expect FORTRAN-order? There's no general answer or magic bullet. Appropriate chunking might be one strategy:
- If a file/dataset is only or mostly accessed by FORTRAN-order applications, you might consider chunking the dataset layout with "column-oriented" chunks.
- If you need it both ways, try to find a chunk size which doesn't favor C or FORTRAN, but doesn't overly penalize either.
In other words, there is no way to tell if a dataset was written by a C or Fortran program. It always ends up in the file in C-order
This is what makes it hard for, let's say a FORTRAN developer to use HDF5 for distributing data and trust that it comes out in the same shape, no? The receipient would need to know that it was written in FORTRAN and handle dimension flipping accordingly. (no rant - am just trying to figure it out)
My apologies for creating a lot of confusion around this issue. I will try to explain it once more, and maybe resolve the issue.
For simplicity, let's assume that we are writing datasets with a contiguous storage layout. The dataset elements are always written to the file in the order in which they are stored in memory. (Order is defined as increasing virtual address order.) That means, for example, that a C application writes the elements of a two-dimensional dataset into the file in row-major order, i.e., the elements of the first row, followed by the elements of the second row, etc. It also means that a FORTRAN application would write a two-dimensional array in column-major order, i.e., the elements of the first column, followed by the elements of the second column, etc.
In which order does a FORTRAN application see the data written by a C-application, and, vice versa, how does a C application see the data written by a FORTRAN application? The correct answer is: "As it was written by the other application." What does that mean?
Let me illustrate that with an example. Suppose a C-app writes a 2x3 matrix
into a 2D dataset in an HDF5 file. On disk, the elements appear in this order:
00 01 02 10 11 12 (0-based matrix indices). When a FORTRAN application
reads this dataset, it will read them in this very order. What is the shape
returned by SUBROUTINE h5dget_space_f(dataset_id, dataspace_id, hdferr)
?
The answer is: 3x2.
You might say, "A 2x3 matrix is not a 3x2 matrix.", and I would agree with you. The hidden and problematic assumption is that HDF5 knows what a matrix is. HDF5 doesn't. All it knows are sequences of dataset elements, and all it guarantees is that it maintains that order, and it doesn't alter your data. (Otherwise it's a bug, or you've applied a filter or dataset transformation.)
Does that make (more) sense?
The key is this statement:
The hidden and problematic assumption is that HDF5 knows what a matrix is. HDF5 doesn't.
Indeed this view is new to me and makes the current situation more clear. My question is: "Does HDF5 wants to know?" Reasoning: it appears that HDF5 datasets are often used to store matrices and arrays of higher dimensionality. Also, there are several interfaces to HDF5:
- C, h5py, HDF5.PInvoke, Java, ... [row-major]
- FORTRAN, Matlab, ILNumerics, Julia, Octave, ... [column major].
Currently, when exchanging a matrix between those interfaces / languages we must accept that the outcoming matrix potentially appears scrambled unless we flip around the dimensions we read from the dataset. ASAIK all interfaces for column major order do this flip today automatically. If we managed to flip the dimensions we still end up with a transposed matrix and will have to correct this or live with the transposed version. Whether we need to do all this is matter of some magic meta info which we need to acquire from the creator of the dataset. She must provide us with a hint of the API used to store the dataset, at least. If the storage scheme of the writing language differs from the storage scheme we are on then we must perform a permutation of the dimensions.
Usually, this metadata would be stored into an attribute of the dataset or the HDF file today. However, the name of such attribute is arbitrary and so we still need to exchange documentation about the attribute. This fact renders the attribute solution ... well, not completely useless, but a lot less useful ... Obviously, currently the choice of the interface used to access HDF5 matters a lot. My request or suggestion is about a way to change this and make HDF5 more compatible between individual languages.
One way could be to introduce an 'official' attribute on datasets which creators/ API authors could use to store the striding of a dataset into. Another, less flexible and compatible and more error prone option could be to add an 'official' attribute used to store the type of the writing API. This way, authors of APIs could react accordingly, perform all permutations if necessary and release the end user from having to deal with this issue.
Maybe there is an official way established already and I have only missed it?
There's no official way as far as I know, but I'll check with the folks here.
There's very little appetite on this end for anything "official." A first step might be to create and circulate an RFC, and see how much or how little enthusiasm it creates.
Are there news on this issue ?
Is there a way to find out whether data has been stored by a C or Fortran program, and then choose to transpose or not the read dataset ?
As stated in the documentation:
When a C application reads data stored from a Fortran program, the data will appear to be transposed due to the difference in the C and Fortran storage orders. For example, if Fortran writes a 4x6 two-dimensional dataset to the file, a C program will read it as a 6x4 two-dimensional dataset into memory. The HDF5 C utilities h5dump and h5ls will also display transposed data, if data is written from a Fortran program.
There is a good explanation when you scroll down to 7.3.2.5. C versus Fortran Dataspaces.
When a Fortran application describes a dataspace to store an array as A(20,100), it specifies the value of the first dimension to be 20 and the second to be 100. Since Fortran stores data by columns, the first-listed dimension with the value 20 is the fastest-changing dimension and the last-listed dimension with the value 100 is the slowest-changing. In order to adhere to the HDF5 storage convention, the HDF5 Fortran wrapper transposes dimensions, so the first dimension becomes the last. The dataspace dimensions stored in the file will be 100,20 instead of 20,100 in order to correctly describe the Fortran data that is stored in 100 columns, each containing 20 elements.
When a Fortran application reads the data back, the HDF5 Fortran wrapper transposes the dimensions once more, returning the first dimension to be 20 and the second to be 100, describing correctly the sizes of the array that should be used to read data in the Fortran array A(20,100).
To me that means you don't have to be concerned in which order the data is saved into the file. Just treat the data transposed when you switch to a language with a different memory layout.
Nevertheless I still see an issue when you try to label the axes of a dataset by adding an attribute to the dataset. The dimensions get reversed but the labels don't:
- seen by Fortran: Dataset:
[10, 20]
, Labels-Attribute:["X-axis", "Y-axis"]
- seen by C: Dataset:
[20, 10]
, Labels-Attribute:["X-axis", "Y-axis"]