h5pyd
h5pyd copied to clipboard
Speed improvements to loading HDF5 trees
In the nexusformat API, we load the entire HDF5 file tree by recursively walking through the groups in h5py, without reading in data values except for scalars and small arrays. On a local file, we can load files containing hundreds of objects without a significant time delay. For example, a file with 80 objects (groups, datasets, and attributes) takes 0.05s to load on my laptop. However, on h5pyd, the same load takes over 20s.
A call to load all the items in an HDF5 group requires two GET requests, and sometimes three, for each object, so there could be an improvement if all the metadata (shape, dtype, etc.) for each object were returned in a single call, and an even more significant one if all the items in a group could be returned with one GET request. Loading one group of 10 objects took 29 requests in my tests.
Binary data reads are fast, though.
I've added some caching logic to the group class. Try out this latest checkin: https://github.com/HDFGroup/h5pyd/commit/19994179a7bcbc23304057647e2fa953f9ccf57c.
This is not a single operation recursive load, but I saw a speed up of about ~4x speed up walking the tree for the sample Nexus file. This is with using the hsls.py script in the app directory.
@rayosborn - did you get a chance to try this out?
I have tested it, but I wasn't sure of the previous speeds because I forgot to do a proper timing before upgrading. I need to revert to the old version. However, I don't think I saw a factor four. It might have been a factor of two.
There will be some variability based on the latency between client and server. My testing was with a server running on the same LAN. Also, the test driver is different.
Did the NexPy GUI need a lot of mods to work with h5serv? I could set it up in my environment.
I haven't made any changes to the NeXpy GUI yet. In the latest development version on my own clone of the nexusformat API, the nxremote
branch has an added file, which subclasses the NXFile class for remote access. I was thinking of pushing this version to PyPI, since it is a test feature that only users with h5pyd would even be able to access. I'll let you know when I've done that.
If you push the branch to github, I can just grab from there.
How would I use it to list the contents of a Nexus file?
The nxremote
branch has been published on my Github. You can load a file by typing:
>>> a=nxloadremote(filepath, domain='exfac.org', server='some.server:5000')
>>> print(a.tree)
The file path is the path relative to the data directory. The module converts that to a domain name. The top domain is currently 'exfac.org' to match the test repository.
@rayosborn - some updates on this old issue...
By default h5pyd.File(filepath)
will return all the meta data for the domain in the request response. H5pyd caches this, so any attribute read or link access won't need to talk to the server. There's a limit on the number of objects fetched on the server of 500. This is so the GET request doesn't take an inordinate amount of time for domains with lots of attributes and/or links.
To compare the performance not using the prefetch, you can use: h5pyd.File(filepath, use_cache=False)
. This will return just information on the root group.
Thanks, @jreadey. I can't look into this for a couple of weeks, but I plan to soon.