HighFive icon indicating copy to clipboard operation
HighFive copied to clipboard

Reading dataset of unknown type

Open lwpiotr opened this issue 3 years ago • 9 comments

Hello,

I want to read an HDF5 file with unknown structure. Afaik, HighFive does not provide any iteration mechanism, so I iterate myself. Still, when I encounter dataset, I don't know what datatype it is. Can I read this dataset with HighFive not knowing the data type at the compilation time? I've tried passing std::vectorstd::any, but HighFive says that this type is unsupported. Basically, I am aiming at similar functionality that I can have in Python with h5py.

lwpiotr avatar Apr 29 '21 13:04 lwpiotr

Hello @lwpiotr

I'm not sure what you mean by iteration mechanism. In HighFive (as in HDF5) you cannot read a dataset not knowing its type. The API requires you to pass the correct datatype (see also H5D_READ). We don't support std::any and I'm not even sure how one would do this in the context of HDF5. The way I would go about this is first call DataSet::getDataType() to query the dataset type and then read the dataset with this type (eg. into a vector of that type).

Hope this makes sense?

ohm314 avatar Apr 29 '21 13:04 ohm314

  1. Regarding iteration, I mean, in the best case something like for(object : group), recursive and not recursive. It seems that https://github.com/ess-dmsc/h5cpp has it for the non-recursive case. The worst case would be for group to have some begin() and end() or something similar, to help the iteration. Another way would be visit() or visititem() like in h5py. Could it be a feature request?

  2. Regarding the data type. OK, I understand. I'll do it manually. Too much of Python :) Still, I wonder if it wouldn't be possible with some C++11 or 17 features. Like std::any, or perhaps something else. I know very little in this matter.

lwpiotr avatar Apr 29 '21 16:04 lwpiotr

For iteration, this seems like something sensible. But what would be returned? A std::variant over HighFive::Group and HighFive::Dataset? Separate iteration methods for both?

for (auto& g: file.recurse_groups()) {
     …;
}

?

matz-e avatar Apr 30 '21 07:04 matz-e

@lwpiotr Maybe it could help if you right what your actual intent is. If I read 'through the lines' it seems that you want to loop over all (or a selection of) datasets, without specifying them, and do some operation on the data.

It is true that for the first some kind of way might be nice in HighFive. I guess is should be some equivalent of

for path in iterator(myfile[root], root):

with h5py. Personally I would welcome a PR in this direction.

Second, in order to say if things can be done to your data using HighFive you would have to specify what you want to do to what kind of data.

tdegeus avatar Apr 30 '21 07:04 tdegeus

I have an HDF5 with "Events", where events have potentially unknown structure - some groups/datasets will be added to some events, but not to the others. My intention was to be able to read such whole event (or events) data (meaning datasets) to memory, without knowing what is inside. With h5py is relatively simple - loop iteratively over the file, check if the current node is a dataset, if it is, store it in, for example, a list or a dictionary.

I am writing an I/O interface, which should be transparent to the user, regardless of the data format. For the moment the back-end is HDF5, but in the future, some other format may be added. I would like the user to be able to get something like map(string, data) from the whole "event" (simply, a dictionary of paths/variable names and data), regardless of what is inside the event, and not thinking about what is the data format).

lwpiotr avatar Apr 30 '21 07:04 lwpiotr

Don't worry to much about the type of data is on the one hand good, but in the other hand worrying about data is exactly what makes compiled codes (much) faster than interpreted codes (like Python). So, you will have to decide about the data, and based on that decision you might end up with more or less casting and copying, which could be costly. So again, the type of data matters here. If you are treating matrices or arrays then I suggest that you choose a nice, common, library that will take most of the trouble away for you and for the end user. If things are quite hybrid things might end up more tricky.

But ok, I know still very little about your problem, so I might still be ill-advising you ;)

tdegeus avatar Apr 30 '21 08:04 tdegeus

@lwpiotr , interesting use-case. As others pointed out. Currently, this isn't possible. But one can work on it. If you feel confident enough to contribute some code, we would be happy to work with you on this! I could imagine some iterator that returns H5Objects that you'd still would have to first query for type and then cast down - or maybe try @matz-e's idea with std::variant. It won't be as dynamic as in h5py, I'm afraid.

ohm314 avatar Apr 30 '21 08:04 ohm314

I always promise to contribute some code, then end up with no time to fulfil :/ Anyway, I understand that the result can't be as dynamic as with h5py, as C++ is still a statically typed language. On the other hand, I guess h5py was coded in C or C++, so they must have figured out a way to provide all the datasets and groups in the HDF5 file to a python user, not knowing apriori anything about the structure and data types. I am considering looking at their code.

Currently, for a fast result, I am using HighFive just to manually loop over objects and handling each dataset datatype case separately. I haven't yet figured out if I can construct compound types dynamically. Also, I am depth-limited to the number of loops that I write, the algorithm is not really recursive and I should figure out something to make it so. I am storing results in map<string, any>, which is far from perfect because to even print the "any", I need to know its datatype, so I end up with yet more ifs.

One feature request that perhaps is very easy to code would be: Currently there is a group method getObjectName() which accepts the object number. However, all other methods like getDataSet(), getObjType(), etc., want object name. Would it be possible for those methods to accept the number? I don't know HighFive internals, but perhaps not using the string with the name until it is really needed could improve the performance. For sure it would make such looping slightly clearer.

lwpiotr avatar Apr 30 '21 08:04 lwpiotr

Since I was getting close to closing this as being to ambitious, I'd like to highlight this part:

One feature request that perhaps is very easy to code would be: Currently there is a group method getObjectName() which accepts the object number. However, all other methods like getDataSet(), getObjType(), etc., want object name. Would it be possible for those methods to accept the number? I don't know HighFive internals, but perhaps not using the string with the name until it is really needed could improve the performance. For sure it would make such looping slightly clearer.

This is worth looking into.

1uc avatar May 05 '23 15:05 1uc