h5pyd icon indicating copy to clipboard operation
h5pyd copied to clipboard

Speed improvements to loading HDF5 trees

Open rayosborn opened this issue 7 years ago • 9 comments

In the nexusformat API, we load the entire HDF5 file tree by recursively walking through the groups in h5py, without reading in data values except for scalars and small arrays. On a local file, we can load files containing hundreds of objects without a significant time delay. For example, a file with 80 objects (groups, datasets, and attributes) takes 0.05s to load on my laptop. However, on h5pyd, the same load takes over 20s.

A call to load all the items in an HDF5 group requires two GET requests, and sometimes three, for each object, so there could be an improvement if all the metadata (shape, dtype, etc.) for each object were returned in a single call, and an even more significant one if all the items in a group could be returned with one GET request. Loading one group of 10 objects took 29 requests in my tests.

Binary data reads are fast, though.

rayosborn avatar Mar 07 '17 18:03 rayosborn

I've added some caching logic to the group class. Try out this latest checkin: https://github.com/HDFGroup/h5pyd/commit/19994179a7bcbc23304057647e2fa953f9ccf57c.

This is not a single operation recursive load, but I saw a speed up of about ~4x speed up walking the tree for the sample Nexus file. This is with using the hsls.py script in the app directory.

jreadey avatar Mar 08 '17 21:03 jreadey

@rayosborn - did you get a chance to try this out?

jreadey avatar Mar 09 '17 20:03 jreadey

I have tested it, but I wasn't sure of the previous speeds because I forgot to do a proper timing before upgrading. I need to revert to the old version. However, I don't think I saw a factor four. It might have been a factor of two.

rayosborn avatar Mar 09 '17 21:03 rayosborn

There will be some variability based on the latency between client and server. My testing was with a server running on the same LAN. Also, the test driver is different.

Did the NexPy GUI need a lot of mods to work with h5serv? I could set it up in my environment.

jreadey avatar Mar 09 '17 21:03 jreadey

I haven't made any changes to the NeXpy GUI yet. In the latest development version on my own clone of the nexusformat API, the nxremote branch has an added file, which subclasses the NXFile class for remote access. I was thinking of pushing this version to PyPI, since it is a test feature that only users with h5pyd would even be able to access. I'll let you know when I've done that.

rayosborn avatar Mar 09 '17 21:03 rayosborn

If you push the branch to github, I can just grab from there.

How would I use it to list the contents of a Nexus file?

jreadey avatar Mar 09 '17 21:03 jreadey

The nxremote branch has been published on my Github. You can load a file by typing:

>>> a=nxloadremote(filepath, domain='exfac.org', server='some.server:5000')
>>> print(a.tree)

The file path is the path relative to the data directory. The module converts that to a domain name. The top domain is currently 'exfac.org' to match the test repository.

rayosborn avatar Mar 10 '17 15:03 rayosborn

@rayosborn - some updates on this old issue... By default h5pyd.File(filepath) will return all the meta data for the domain in the request response. H5pyd caches this, so any attribute read or link access won't need to talk to the server. There's a limit on the number of objects fetched on the server of 500. This is so the GET request doesn't take an inordinate amount of time for domains with lots of attributes and/or links.

To compare the performance not using the prefetch, you can use: h5pyd.File(filepath, use_cache=False). This will return just information on the root group.

jreadey avatar Dec 01 '22 19:12 jreadey

Thanks, @jreadey. I can't look into this for a couple of weeks, but I plan to soon.

rayosborn avatar Dec 06 '22 20:12 rayosborn