bcolz introduce chunk._getitem_r(...), a thread-safe version of chunk.

chunk._getitem(...) depends on and modifies instance attributes of chunk.

The consequence is that in multi-threaded applications access to the chunk instance has to be controlled by locking which prevent multi-threaded code based on bcolz from scaling appropriately with the number of processors.

The introduced function chunk._getitem_r(...) provides the same functionality without requiring the GIL or any locking to protect instance attributes. This allows writing efficient multi-threaded cython algorithms based on bcolz.

Use case: visualfabriq/bquery#22

Mar 08 '15 18:03 ARF1

This looks good, thanks. Unfortunately c-blosc only supports being called effectively from multi-threaded operations in version >= 1.5, and with lesser versions, c-blosc makes sure that a lock is put during multi-threaded operation.

Also, and due to some not completely clear reasons, c-blosc 1.5 series still have some stability problems with bcolz (e.g. https://github.com/Blosc/bcolz/issues/120, https://github.com/Blosc/bcolz/issues/121), and this is why bcolz still comes with c-blosc 1.4 series. The result is that you probably won't be seeing any speedup with this PR (yet, and until c-blosc 1.5 could go into bcolz).

At any rate, if you can compile bcolz with c-blosc 1.5 and tell us the speed-ups that you are seeing, that would be great. I am a big fan of seeing real improvements before adding new code.

Mar 09 '15 07:03 FrancescAlted

Benchmarking with c-blosc 1.5 is on hold as it depends on resolving issue Blosc/c-blosc#92. (The c-blosc 1.5 series does not compile or produces runtime failures on my machine.)

Mar 09 '15 13:03 ARF1

What is the status of this https://github.com/Blosc/c-blosc/issues/92 has been resolved, right? Does 1.6 work better on your machine now?

May 02 '15 12:05 esc

@esc This PR arose from the effort to parallelise factorisation in bquery. I have not worked on it recently but plan to get back to it soonish.

C-blosc 1.5.4 does compile on my machine with MSVC. Sporadically the test suite crashes however. 1.6 (or more precisely tip of master) I could not compile. It complained that the new compression algorithm was not available on my machine.

May 06 '15 08:05 ARF1

This would probably speed up bdot on multicore machines as well. Might even enable greater than numpy speeds under more conditions.

Would love to see this become a part of bcolz.

I'll try to construct a test of bdot with code from this PR, as soon as I can.

Jun 18 '15 21:06 waylonflinn

@waylonflinn I am not completely sure that the bottleneck this was supposed to circumvent was real. There were issues with my test database as well as with c-blosc at the time and I have not investigated the issue with the current c-blosc yet. I have been meaning to look into it, but it is currently a fairly low priority for me. - Just wanted to give you a heads up.

I am not familiar with bdot. Would you mind posting a link to the relevant code so to get a better idea of the context?

I would be very interested in any benchmark results indeed.

Jun 18 '15 21:06 ARF1

@ARF1 thanks for the heads up!

bdot is pretty new (I released the first version this weekend). It's an extension like bquery. It allows you to calculate matrix products on matrices stored in carrays.

Here's a link to one of the relevant pieces of code: https://github.com/pinleague/bdot/blob/master/bdot/carray_ext.pyx#L68

Jun 18 '15 21:06 waylonflinn

@waylonflinn Thanks. That actually looks very useful to me too! Is there a list of bcolz-based projects anywhere? More and more seem to be popping up.

More to the point, I can understand your interest in parallel processing of chunks. I am sure you are already aware of cython prange. If not, there is a multiprocessing wip PR for bquery that can serve as inspiration. (Hint: with the classic _getitem() you will need nested without gil: and with gil: blocks.)

I am not sure however that the non-gil thread-safe version of _getitem() will make a huge difference in addition to prange: it is my understanding that current c-blosc version enforces a global lock. In other words, you are likely to avoid the python GIL only to run into the c-blosc equivalent. That said, I believe c-blosc already uses all available cores for decompression of chunks anyway unless the chunks are very small.

Still benchmarks would be really good to have and your application is probably ideally suited for such an experiment. We would finally know whether this PR is worth keeping active.

Jun 18 '15 22:06 ARF1

@ARF1 you may want to join the bcolz google group:

https://groups.google.com/forum/#!forum/bcolz

We have been discussing bdot over there for a few days already.

Jun 19 '15 05:06 esc

Yes, @ARF1 suggestion of using prange is a good one. And he is right, Blosc used to pose a lock itself (similar to GIL), but with c-blosc >= 1.5 there is a couple of calls (those ending with '_ctx') that are meant to run c-blosc without a lock. I am experimenting with that in branch https://github.com/Blosc/bcolz/tree/c-blosc-1.7 , and the results are promising. Hopefully a compression withut locks will happen in the next release of bcolz (keeping my fingers crossed).

And yes, moving this into the mailing list maybe a good thing.

Jun 19 '15 07:06 FrancescAlted

One newbie but serious question (it's something i'm struggling with in bquery): how do you ensure the order of chunk processing when putting them in parallel? Because I can imagine a wait mechanism, but wouldn't that potentially cause buffer issues (one very slow chunk followed by very quick chunks)?

Jun 19 '15 08:06 CarstVaartjes

I don't think that prange allows you to ensure an order, but if you want that, then use regular range right?

Jun 19 '15 09:06 FrancescAlted

@FrancescAlted With regular range you of course loose the ability to process your chunks in parallel.

I think @CartVaartjes was asking for a silber-bullet: parallel processing of chunks, in-order results, without wait... Intuitively this sounds like a classic application for the producer-consumer pattern but unfortunately with openmp (cython threads) that is not a pattern that lends itself to easy implementation.

@CarstVaartjes You might remember I also had this question when I started my multiprocessing branch of bquery.

When you use i in prange(0,100) with two threads, thread 1 gets the range (0:50) while thread 2 gets the range (51:100). Within each thread, the results are in-order but since they are processed in parallel you end up with your results in a mangled mess.

There are several ways of dealing with this:

Since you have the variable i you know what chunk your result belongs to, so you can save the result out-of-order. (I.e. using the save equivalent of _getitem().) This is what I did initially with multiprocessing bquery. - Not too elegant, since the natural chunksize of your result array might be quite different from that of the source array.
You could save the results into a results array in the order they arrive but remember the order in which they were saved. Then reorder in a second step. This deals with the chunksize-issue and is easy to implement but always sounded like "cheating" to me...
You could buffer the results from each thread in a thread-local buffer with a buffer size equal to the optimal chunksize of the results array and write these buffers as properly numbered out-of-order results array chunks to disk. This also gets around the mis-adjusted chunksize-issue and the wait-issues but is a bit of a programming/debugging challenge.
You could go the easy road and play with the schedule=dynamic and chunksize parameters of prange. Depending on the structure of the data you are processing, this can help alleviate the issue of the wait mechanism with slow chunks by assigning "super-chunks" (consisting of multiple chunks) to the threads. If you only wait for each super-chunk, the difference in processing time of an individual chunk may (or may not) average out over the super-chunk. - Not really a good all-purposes solution but maybe useful in some particular cases.

I have a semi-working implementation of the thread-local buffer approach in an unpublished multiprocessing bquery branch. Let me know if you are interested in using it as a quarry for code segments.

Of course in all cases of bcolz chunk processing, even a sub-optimal multiprocessing solution is likely to be faster than a single-core solution. Its just the scalability with the number of cores that suffers, right?

Jun 19 '15 11:06 ARF1

introduce chunk._getitem_r(...), a thread-safe version of chunk._getitem(...)