pyFAI icon indicating copy to clipboard operation
pyFAI copied to clipboard

Single threaded 1D integration : performance glitch?

Open jonwright opened this issue 1 year ago • 15 comments

I am seeing something like 10 fps per core for our Eiger 4M images, which seems to be in line with the pyFAI-benchmark plots in the docs.

For some reason I timed a fast_histogram.histogram1d and this seems to offer 10 ms, which would be 100 fps, but no pixel splitting etc. That number motivated some messing about to take your CSR matrix try to tune it to reduce the matrix data size and then run the dot product via numba. So far it reached 18 ms, or about 50 fps for the dot product. These seem to be worth making into patches if they are getting the numbers right - what do you think?

The CSR python code in pyFAI is giving: "CPU times: user 119 ms, sys: 32.4 ms, total: 151 ms", which is consistent with doing two dot products with scipy.sparse.csr.dot at 59 ms each? The second product seems like it could be cached for processing a series of frames, but maybe I have missed something that changes. Do you know what the system time is used for ?

Some code should be attached: SingleCore1D.md

jonwright avatar Jul 08 '22 17:07 jonwright

PyFAI does not perform 1 or 2 histograms per integration, but 4 and soon 5 to be able to handle uncertainties (needed for sigma-clipping). On my Epyc processor (i.e. @ home) histograms are faster then CSR-products due to its huge cache size. Which algorithm/implementation are you referring to ? cython-csr or cython-histogram ? To get decent performances, one needs to tune the placement of those jobs using for example taskset. https://gist.github.com/kif/d3c4c30bea79cd5c86b0074091128e40 To get code that scales, Cython is not the right tool (neither numba...). Best scaling was observed with OpenCL in this Gist.

About your benchmarks: you are using %time, please use the module timeit as it is done for benchmarking.

Honestly, I did not have time to profile intensively integrate1d_ng (and even less the 2D version) so it is possible that object reset occur silently.

The dozens of rebinning engines are now fairly well aligned and once all are alligned it will be possible to simplify a lot integrate1d_ng (the python method which is in AzimuthalIntegrator). Since all methods have the same interface, you can directly use the engine which are all stored into the ai.engines dict.

About fast histograms, did you know that silx has also several fast histograms (much faster than those of numpy) ?

kif avatar Jul 08 '22 21:07 kif

Thanks for taking a look. For the example in your gist: the best time I could see is about 23 ms when running in parallel and this should correspond to something 8X speedup, or >160 ms for a single core? I copied the code above into a gist to make it easier to view here. It seems like there is a way to get towards the 23 ms timing on a single core, looking at that example.

jonwright avatar Jul 09 '22 10:07 jonwright

So you are using the CSR matrix, transposed, to perform histograms ? then it is no more parallelizable. The CSR format was chosen because a single integration can be parallelized.

Did you try it with taskset -pc 0-38:2 to use a singe processor ? you should get at least a factor 10 speed-up on 18 cores. On amd epyc computer, it should even be restricted to a set of cores sharing the same L3. lstopo is your friend. When performance matters really, use a GPU. You have numba-cuda there if you prefer cuda to opencl ... Guidelines from my management is clear: focus on GPU, what's where added value lies. Cython code is just to make demonstrators running everywhere.

kif avatar Jul 09 '22 10:07 kif

For the scaling - I was looking for something single threaded to scale via one frame per cpu thread. An example with concurrent futures seems to be not too bad. It only drops to 30 ms per frame when all 40 cores were working, which is above 1 kHz.

jonwright avatar Jul 09 '22 13:07 jonwright

I just checked, see profiling.pdf, and indeed:

  • scipy.sparse is faster than the cython code from pyFAI when working in single-threaded (157 vs 176ms)
  • The CSC matrix format is even faster than CSR by a couple of millisecond per call, 26 vs 28 ms (there is a minimum of 3 calls)
  • Building a CSC integrator from the CSR integrator takes 3 lines of python, this is a low hanging fruit for an integration into pyFAI.
  • the overhead of AzimuthalIntegrator is 3 ms
  • More than 60 ms (63 and 68 were measured) are spent in the preprocess part which is in charge of the dynamic masking (among other things).

kif avatar Jul 09 '22 20:07 kif

I was confused as to why my csr.dot was 3X slower. It seems that converting to float32 in preproc changes the output type:

pye = list(ai.engines.values())[1].engine
%timeit result=pye._csr.dot(img.ravel())
print(pye._csr.dot(img.ravel()).dtype)

59.1 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
float64

signal = img.ravel().astype(np.float32)
%timeit result=pye._csr.dot(signal )
print(pye._csr.dot(signal ).dtype)

21.6 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
float32

For now I didn't see a way around that without compiling something?

jonwright avatar Jul 10 '22 08:07 jonwright

So it seems the dynamic mask and 3 integrations might be done in ~45 ms for this case (singleThread6M-mask_3array.pdf. Although it is not really clear how to set this up as a pull request!

jonwright avatar Jul 10 '22 12:07 jonwright

You code raises a comment: accumulator should not be float32, while coefficient of the matrix, diffraction intensity and normalization are all OK with 32 bits, you need at least 64 bits for accumulators, unless you will loose several % of precision on the result. One can cut corners on count and normalization, not on the signal accumulator (nor on the variance, obviously!).

There is no way to work-around dynamic masking without compiled code (numba is also compiled). There is an implementation of preproc in pyFAI/engines/preproc.py but it is usually much slower.

I just reviewed the preproc implementation in cython and it has not been fixed for a performance bug occurring on large NUMA systems: parallel for on such memory bound code does not scale beyond 4 to 8 threads. #1553

kif avatar Jul 10 '22 14:07 kif

The performance of preprocessing is terrible on my computer: Figure 1(1) The best would probably to disable OpenMP for this function !

A lot of time (~50%) is lost in the output-array malloc.

To be constructive, one could imagine having a preproc_single_point cdef-function which could be called from several histogram function but I fear this will be complicated.

kif avatar Jul 10 '22 16:07 kif

I seem to remember that bitshuffle only scales well for a few cores, so for our data it is another reason to run several smaller jobs...

jonwright avatar Jul 11 '22 06:07 jonwright

Vincent highlighted that the issue of "preproc" being slow does not show off on GPU. There, the processing takes 0.15s.Thus optimizing pyFAI for single threaded CPU code with the idea of parallelizing it on several CPU-threads is not ESRF's priority. To state it simply: you have GPUs, please use them.

That said, thanks a lot for the huge work made on profiling and digging into the code. I still believe it makes sense to prevent the malloc of "nbPix*4" and recycle the buffer. This can potentially save 20ms, to be validated. I opened a different issue: #1718

kif avatar Jul 11 '22 09:07 kif

A good compromise (for me) could be to add a public API for getting a CSR matrix out. Would that be possible? It would need some additions to the docs explaining how to use it (and this would help to clarify things for me too - so I would be happy to help writing).

I would like to use the CSR matrix for the bslz4decoders project, which is GPU based. Our code driving pyFAI for power9 is probably not optimal, but we did need to use more than one IO thread there too.

jonwright avatar Jul 11 '22 10:07 jonwright

what about this ? https://github.com/silx-kit/pyFAI/blob/master/pyFAI/azimuthalIntegrator.py?plain=1#L447-L500

There are typos in the docstring but any of those returned object has a "csr" attribute which allows you to redefine your own integrator

kif avatar Jul 11 '22 15:07 kif

Looks good to me, ai.setup_CSR() is returning an object that has (indptr, indices, data). Assuming this will look similar after #1586 then it seems ideal. Thanks!

Feel free to close this if there is nothing else specific to fix.

jonwright avatar Jul 11 '22 20:07 jonwright

I'll keep it open because several interesting benchmarks are referenced. You convinced me that having optimized single-threaded version could make sense in specific use-cases despite it is not ESRF's priority for the moment.

kif avatar Jul 12 '22 07:07 kif

Hi Jon, I spent a bit of time on this topic: we know OpenMP does not scale well and I wonder if the CSC approach + multithreading could provide better performances compared to the parallel CSR (all Cython, plain C or C++, figures from my 8-core epyc).

CSR integration: 17.1ms CSC integration: 22.8ms scipy sparse matrix multiplication 5.46ms but if one accounts for the 5 histograms to perform (signal, variance, normalization, normalization squared and count) this is even 27.3 ms, not accounting for the tests related to dynamic masking.

kif avatar Sep 08 '22 08:09 kif

I tried to parallelize with mutithreading (ThreadPoolExecutor.map) and integrate a set of 1000 random images:

  • 1 thread: 23ms (consistent with timeit)
  • 2 threads: 12ms
  • 4 threads: 6ms
  • 8 threads: 4ms
  • 16 threads: 3ms (this is using hyperthreading)

It confirms your initial assumption that CSC was easier to parallelize on multiple frames and that it scales decently. I'll open a PR with what I did on this...

kif avatar Sep 08 '22 08:09 kif

This looks great, did you already try it on a 40 or 64-core machine on the cluster? Thanks!

jonwright avatar Sep 08 '22 09:09 jonwright

Nop, I am on remote work, thus playing with what is available locally :)

kif avatar Sep 08 '22 12:09 kif

I does scale nicely for now but it will hit the I/O bound of the CPU memory like all other algorithms.

kif avatar Sep 08 '22 12:09 kif

This looks great, did you already try it on a 40 or 64-core machine on the cluster? Thanks!

have a look at #1733

kif avatar Sep 08 '22 15:09 kif