h5serv icon indicating copy to clipboard operation
h5serv copied to clipboard

Performance on large data

Open kgoebbels opened this issue 9 years ago • 11 comments

We have to send many small files ( ~8MB) to a hdf5 server in a short time. Although there is a 1Gbit/s connection on both sides, only around 6MB/s are achieved.

Is there any way we can improve the speed or is it a general problem. Most of the time seems to be lost by writing to the hdf5 file.

kgoebbels avatar Nov 03 '15 14:11 kgoebbels

Hi - you're question is interesting. I haven't done much in the way of performance testing yet, but it's not surprising to see the throughput rates you are getting. Since in the REST API, binary data is converted to JSON, transmitted over the wire, and then re-constituted to binary data on the server, there's a fair amount of overhead involved.

One thing I have on my todo list is to support content-type with octet-stream (or something similar) for GET's and PUT's on dataset values. This way the server will not need to parse the data, but can just pump it directly into the dataset. My assumption is that dataset value GET and PUT are the most performance critical operations and supporting binary data for these would improve performance the most.

If you are looking for a quick improvement and have the possibility to directly copy files to the server data directory, that will of course be faster than transmitting data via the REST api.

Is read performance less critical for you?

jreadey avatar Nov 04 '15 06:11 jreadey

Hi, thanks for your answer.

Currently we only need to transfer the data to the server. So read performance is not critical for us at the moment.

For now we're using rsync, but we'll have a look at your project for future usage.

kgoebbels avatar Nov 10 '15 08:11 kgoebbels

I'm planning to work on supporting octet-streams later this year. That should be a big help.

BTW, one use case where the server would have advantageous over rsync is when you have multiple writers who want to update the same file (say you had multiple sensors that were feeding in update streams).

Another, obviously, is when the client doesn't have a shared filesystem and you need to rely on http transport.

jreadey avatar Nov 10 '15 16:11 jreadey

@kgoebbels - If you've collected any performance timing profiles, I'd love to see them. I'm surprised you said that most of the time is spent writing to the hdf5 file. I'd have thought it was network delay or json conversion. Anyway, it just goes to show that it's not very useful to start on performance optimization without knowing where the bottle necks are!

jreadey avatar Nov 10 '15 20:11 jreadey

You can find some performance measures with my python client here.

They are not that detailed, but I hope they are useful for you.

kgoebbels avatar Nov 11 '15 13:11 kgoebbels

Thanks! Is this going across the network or is it writing to a local instance of h5serv?

I see what you mean now by "most of the time is spent writing to the hdf5 file". Not surprising as that's where most of the data transfer is happening. As a next step it would be useful to break our what's actually happening on the server side: time for the request to get transmitted, time for h5serv to format json request as numpy array, time to do actual h5write, etc.

I likely won't get to this immediately, but will set aside some time soon to look into it.

jreadey avatar Nov 11 '15 16:11 jreadey

The data war transfered across the network to an instance of h5serv.

kgoebbels avatar Nov 13 '15 12:11 kgoebbels

One thing that might help is fire off multiple async requests to the server, rather than sending request, waiting for a response, sending another request, etc. This makes the coding a little more tricky for the client though.

I'm labeling this issue as an "enhancement" since it requesting a performance improvement.

jreadey avatar Nov 13 '15 23:11 jreadey

FYI, I created issue #76 to track work on the binary transfer support. Also, I'll be doing some performance characterization of h5serv as part of the datacontainer project: https://github.com/HDFGroup/datacontainer.

@kgoebbels - in your use case are you:

  1. creating many small files on the server
  2. creating many datasets in one file
  3. writing to one dataset?

For 1. you might be interested to know that there is no support for getting a list of files on the server through the rest api. See test/integ/dirtest.py for an example of how this works.

If 2, do you have a convention to avoid link names clashing (two clients trying to add the same link)?

If 3. there's a bit of a synchronization issue if two clients are trying to extend the same dataset. I've been thinking about adding an append operation to get around this.

jreadey avatar Feb 01 '16 16:02 jreadey

My use case is your number 2 ("creating many datasets in one file").

On the client side I have a lot of small files. But on the server side I want to collect the data from all these files in one single HDF5 file. Therefore my client is reading from every file and sending the data to the HDF-Server. For every file on the client side a new dataset will be created in the same HDF5 file.

To avoid link names clashing I am using a "ProcessPoolExecutor" in my Python code. So every worker process has to read its own file and can use its own link.

kgoebbels avatar Feb 02 '16 08:02 kgoebbels

@kgoebbels - Issue #76 (binary transfer for read and write) is complete. I'd be interested to hear how if this helps performance in your case. You can take a look at test/integ/valuetest.py: testPutBinary() to see how this works. In my testing, performance was 10x better using binary transfers (though YMMV).

jreadey avatar Apr 22 '16 19:04 jreadey