z5 Z5 performance

Z5 performance

Open weilewei opened this issue 5 years ago • 15 comments

Hi,

I am not sure if this is a correct question to ask. How do you see the performance of parallel I/O for Z5 in the context of distributed computing comparing with other I/O libraries (hdf5)? This is just a general question. Because I am thinking to integrate Z5 into HPX https://github.com/STEllAR-GROUP/hpx, a C++ Standard Library for Concurrency and Parallelism, in the near future. I would like to see if there is any performance benchmark that I can refer to and any performance comparison that I can make. This could be my graduate study project.

Over this summer, I am working on the C API interface for Z5 (https://github.com/kmpaul/cz5test). The idea of this project is to test Z5 performance with another parallel I/O library. This project is still in progress. I would love to share the results with you once they are in place.

Any suggestion will be helpful. Thanks.

Jun 26 '19 16:06 weilewei

I think @aschampion designed the benchmarks for rust-n5 to match some benchmarks being done on the java reference implementation of N5, ~~although I don't know where the java benchmarks are,~~ which are here. Not sure about the zarr side of things, although I personally see more of a future for zarr than N5, as a format.

Jun 26 '19 17:06 clbarnes

I am not sure if this is a correct question to ask.

No worries, this is a perfectly legit question to ask here.

How do you see the performance of parallel I/O for Z5 in the context of distributed computing comparing with other I/O libraries (hdf5)?

In general, the big advantage of z5 (or to be more precise of n5 / zarr) compared to hdf5 is that it supports parallel write operations. It's important to note that this only works if chunks are not accessed concurrently (in the case of z5, this will lead to undefined behaviour). @clbarnes and me actually started to implement file locking for chunks quite a while back, but this has gone stale, mainly because there are several issues with file locking in general. If you want to know more about this, have a look at #65, #66 and #63.

In terms of single threaded performance z5 and hdf5 are roughly equal (I compared in python to h5py performance at some point.)

Because I am thinking to integrate Z5 into HPX https://github.com/STEllAR-GROUP/hpx, a C++ Standard Library for Concurrency and Parallelism, in the near future.

That would be awesome!

in the near future. I would like to see if there is any performance benchmark that I can refer to and any performance comparison that I can make.

Besides the java / n5 benchmarks that @clbarnes mentioned, here are the benchmarks I used while I developed the library, also comparing to hdf5: https://github.com/constantinpape/z5/tree/master/src/bench

Since then, I started to set up an asv repository: https://github.com/constantinpape/z5py-benchmarks But this is still unfinished (any contributions would be very welcome!).

Over this summer, I am working on the C API interface for Z5 (https://github.com/kmpaul/cz5test).

I will give you some feedback on this in the related issue #68.

Jun 26 '19 19:06 constantinpape

Thanks for your explanation @constantinpape @clbarnes! I am reading your comments and searching around these days.

In general, the big advantage of z5 (or to be more precise of n5 / zarr) compared to hdf5 is that it supports parallel write operations. It's important to note that this only works if chunks are not accessed concurrently (in the case of z5, this will lead to undefined behaviour). @clbarnes and me actually started to implement file locking for chunks quite a while back, but this has gone stale, mainly because there are several issues with file locking in general. If you want to know more about this, have a look at #65, #66 and #63.

It's great to see that z5 supports parallel write, it might be a good fit for HPX which can launch many threads and processes. It's also sad to see that hdf5 currently does not support parallel write, and I just noticed that they are proposing parallel design here https://portal.hdfgroup.org/display/HDF5/Introduction+to+Parallel+HDF5. This also means that probably I could not get any performance comparison between z5 and hdf5 in the context of parallel computing in the near future.

Also, I am not sure about how to solve the file locking issue at this point. I will need to look into it in details later.

In terms of single threaded performance z5 and hdf5 are roughly equal (I compared in python to h5py performance at some point.) Thanks, good to know.

Besides the java / n5 benchmarks that @clbarnes mentioned, here are the benchmarks I used while I developed the library, also comparing to hdf5: https://github.com/constantinpape/z5/tree/master/src/bench Since then, I started to set up an asv repository: https://github.com/constantinpape/z5py-benchmarks But this is still unfinished (any contributions would be very welcome!).

Thanks for providing such benchmarks! I am trying to see if I can measure some performance on the C++/C side as my projects are mainly on these two languages.

Jun 28 '19 22:06 weilewei

It's also sad to see that hdf5 currently does not support parallel write, and I just noticed that they are proposing parallel design here https://portal.hdfgroup.org/display/HDF5/Introduction+to+Parallel+HDF5.

My first answer was not quite precise with regard to parallel writing in HDF5. There is support for parallel writing, but it is not a feature supported by default. As far as I am aware of it there are two options to do this in HDF5, both with some downsides:

parallel HDF5 (the link you posted); only works with MPI, does not allow thread based access, to the best of my knowledge not implemented in h5py, so not available in python, also not available in Java.
The other option is to use region references https://support.hdfgroup.org/HDF5/Tutor/reftoreg.html and parallelize over regions that are stored in separate files. This approach becomes very similar to the chunk based storage of N5 / zarr, but it's probably less efficient if regions become too small (I have never checked for this, so just my gut feeling).

Jun 29 '19 07:06 constantinpape

I see. Thanks for sharing and answering. @constantinpape

Jul 01 '19 19:07 weilewei

Sorry for bringing up another performance issue again. Could you please take a look at the issue I asked here: https://github.com/QuantStack/xtensor/issues/1695? Let me know if you have any suggestion. Thanks.

Jul 26 '19 19:07 weilewei

Could you please take a look at the issue I asked here: QuantStack/xtensor#1695?

Thanks for bringing this up. I just came back from vacations and I had a quick look and I think I have some ideas. I will try to have a closer look and write something up tomorrow.

Aug 02 '19 22:08 constantinpape

First, let me provide some context on the performance issue you brought up:

This header contains functions to read / write a region of interest (ROI) from / into a dataset into / from a xtensor multiarray. I will focus on reading here, writing is mostly analogous.

Reading the ROI works as follows:

Iterate over all chunks overlapping with the ROI.
Get a view into the array corresponding to the current chunk
Read the chunk data into a flat buffer
Copy the data from the flat buffer to the view into the (nd!) array
In the last case, there are two possible cases: the chunk overlaps completely or overlaps only partially with the ROI.

For the first case (complete overlap) I have noticed that the naive way of copying via xtensor functionality

const auto bufView = xt::adapt(buffer, chunksShape);
view = bufView;

was a major bottleneck, so I implemented a function to copy from buffer to view myself. This function does not work for 1d tensors though, and I did not bother to fix this or implement a separate copy function for 1d, so I just fall back to the naive xtensor copy here. This seems to be the performance issue that you encountered.

I see two options to deal with this:

Fix / expand copyViewToBuffer and copyBufferToView s.t. it also works for 1d arrays.
Investigate how to improve performance within the xtensor functionality.

Option 1 should be straight-forward: I think the functions would only need to be changed a bit, or a special function for the 1d case could be implemented.

Option 2 would be more interesting though: I only tried the naive approach with xtensor, i.e. I did not specify the layout types for the views into the array and buffer. This might improve performance enough, maybe even to get rid of my custom copy functions completely. I am not quite sure how well this would work though, because the view into the multiarray is strided. Maybe @SylvainCorlay @wolfv or @JohanMabille could provide some insight here.

If we were to get rid of the custom copy functions completely, this needs to be benchmarked carefully, because I don't want to regress compared to the current performance.

Aug 03 '19 13:08 constantinpape

Thanks for the updates, @halehawk and I will look into this soon.

Aug 08 '19 16:08 weilewei

Just FYI, my presentation of the summer intern project is online now: https://www2.cisl.ucar.edu/siparcs-2019-wei where I reported how to integrate Z5 into an earth model and performance comparison between Z5, netCDF4, and pnetCDF. I will keep you posted if we have any future publication.

Aug 21 '19 04:08 weilewei

Just FYI, my presentation of the summer intern project is online now: https://www2.cisl.ucar.edu/siparcs-2019-wei where I reported how to integrate Z5 into an earth model and performance comparison between Z5, netCDF4, and pnetCDF.

Thanks for sharing this and great work! I have one question: Which compression library did you use in z5 for the performance analysis (slide 10/11) and did you compare the compression ratios between z5 and netCDF? Also, did you compare the performance of z5 and PnetCDF when you don't use compression in z5 (compressor=raw)?

I will keep you posted if we have any future publication.

Looking forward to it!

Aug 21 '19 07:08 constantinpape

I have one question: Which compression library did you use in z5 for the performance analysis (slide 10/11) and did you compare the compression ratios between z5 and netCDF? Also, did you compare the performance of z5 and PnetCDF when you don't use compression in z5 (compressor=raw)?

We use zlib for compression. The compression rate between z5 and netCDF is similar. No, I haven't tried to do no-compression setting between z5 and PnetCDF (maybe we will try it later).

Aug 21 '19 12:08 weilewei

I tried non-compression setting on z5 once, and didn't get different results on the timing. Maybe the output size is fairly small (float number 3192288 on each processor).

On Wed, Aug 21, 2019 at 6:33 AM wei [email protected] wrote:

I have one question: Which compression library did you use in z5 for the performance analysis (slide 10/11) and did you compare the compression ratios between z5 and netCDF? Also, did you compare the performance of z5 and PnetCDF when you don't use compression in z5 (compressor=raw)? We use zlib for compression. The compression rate between z5 and netCDF is similar. No, I haven't tried to do no-compression setting between z5 and PnetCDF (maybe we will try it later).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/constantinpape/z5/issues/118?email_source=notifications&email_token=ACAPEFFNDKFCK25DAZU7ZW3QFUYYJA5CNFSM4H3UFFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4ZP7KQ#issuecomment-523435946, or mute the thread https://github.com/notifications/unsubscribe-auth/ACAPEFFQQURCSV6PDZFKBVDQFUYYJANCNFSM4H3UFFHQ .

Aug 21 '19 16:08 halehawk

Ok, thanks for the follow up

Maybe the output size is fairly small (float number 3192288 on each processor).

That's indeed fairly small. From my experience raw compression can bring quite a speed up.

Aug 21 '19 17:08 constantinpape

We used zlib, level=1 compression. But the compressed size is larger than that of netcdf4 using the same compression. We used the same chunk size on both. I have not figured it out the difference yet.

On Wed, Aug 21, 2019 at 11:21 AM Constantin Pape [email protected] wrote:

Ok, thanks for the follow up

Maybe the output size is fairly small (float number 3192288 on each processor).

That's indeed fairly small. From my experience raw compression can bring quite a speed up.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/constantinpape/z5/issues/118?email_source=notifications&email_token=ACAPEFAR4TNYKSL2OY7OJRDQFV2TPA5CNFSM4H3UFFH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD42OYPI#issuecomment-523562045, or mute the thread https://github.com/notifications/unsubscribe-auth/ACAPEFEVKZ6V53DJ2ERPM3LQFV2TPANCNFSM4H3UFFHQ .

Aug 21 '19 17:08 halehawk

z5 z5 copied to clipboard

Z5 performance

z5
z5 copied to clipboard