Large potential speedup of `pycbc_multi_inspiral`
I am reporting some observations about the run times of pycbc_multi_inspiral jobs in recent PyGRB workflows submitted to the OSG.
First, the run time can vary by orders of magnitude depending on injections. For example, in one run, the run time can go from ~20 min to ~2 days. Jobs without injections have much more consistent run times, but they also tend to have a tail.
Second, when the number of templates and/or sky grid points becomes nontrivial, the run time becomes linear with them, e.g. with the 2.8.2 code version I get this for the user time:
(the different colored curves are for 1, 3, 10, 30 and 100 sky grid points respectively). The run time is also somewhat linearly related to the number of triggers produced in the end:
Consistently with this plot, the longest-running jobs in the workflows appear to be associated with the loudest injections. So I took one of the longest-running jobs (which has several hundred sky grid points) and did some profiling:
3/4 of the time is being spent in computing the power chi^2, and 14% is spent appending triggers to a Numpy array. The cost of the chi^2 can be understood because the chi^2 is recalculated at every slide/sky combination. If I instead precalculate the chi^2 once before the loop over slides/sky, the same job completes in about 1/4 of the time, as expected. The profiling now reveals a more complex picture:
The next thing to attack is the accumulation of the triggers which is now taking half of the time (!). I would try preallocating a large array of triggers instead of growing one continuously.
I'd enjoy digging into this, so if you have a reproducible example and don't have others already looking at this, I'd be interested to take a look!
Posting this here to keep things in one place. Here's what happens when I do as @titodalcanton suggests and preallocate a large array of triggers. Next target would be the division operation which seems to be in the main executable (I have a hunch I know what this is, and removing it entirely will be easy if it's right).
A request though: @titodalcanton Could you provide a short (but fully representative .. ie. one template is not enough because we need things like clustering to run a few times) example with output so that I can check that nothing done to optimise breaks anything.
I was right about the divides. It's the bank-chisq and auto-chisq computation. Even though these are both disabled there's a divide in the inputs to the functions. @titodalcanton These should be moved similarly to what you already did for the chi-squared (but for now I'll just avoid the divide so these can "no-op" quickly).
HEre's a profile with #5153 and not using chisq-snr-threshold
One more option for optimization: The PyCBC chi-squared calculation is optimized for a small number of points, and is not sensitive to the number of bins. The old coh_PTF computed chi-squared using one FFT for each chisq bin, it is not sensitive to number of points, is sensitive to the number of bins, and is much slower in the small-number-of-points limit. If we choose to "do chisq like coh_PTF", which means using a fixed number of chisq bins (16) and using PyCBC's function to compute chisq in the old way (which needed to be ported to the class-based interface to be fast), we can get this profile:
With these cumulative changes the job should complete in less than 90 minutes ... But there still seems to be a lot of scope for improvement, I think coh_PTF was FFT limited, whereas this is still only ~7% of time doing FFTs.
After #5157 and #5158, this is what I get for the same job:
Note that the chi^2 is still not negligible, so @spxiwh's remark about point vs FFT-based chi^2 still holds. Also, numpy.setdiff1d() shows up in a non-negligible way as well, so maybe there is a way to do things faster than what I did in #5158.
As evident from above, though, the next target would be #5154.
@spxiwh here are some notes on a few test cases that should help you test #5154. Look here at CIT:
~tito.canton/projects/pycbc/pygrb/multi_inspiral_test_cases
Each numbered subdirectory is a test case based on jobs I cherry-picked from a couple recent workflows (big sky grid, single sky point, with injections, without injections, HL and H only). The run.sh script within each directory runs the test in a self-contained way, and saves what follows to the current working directory:
- The output HDF file.
- A file
time.txtwith the results of/usr/bin/time --verbose(user time, system time, max RSS etc). - A profiling output.
- A copy of the stderr of the job.
You may be aware of this, but note that the output of pycbc_multi_inspiral is not deterministic: if you run the same test twice, identically and on the same machine, you often get slightly different results, including slightly different numbers of triggers. I found this is partly due to non-determinism in the PSD estimation, which I suppose ultimately comes from FFTs, though I have not investigated this more deeply. One way to improve this is to dump the PSD estimates once, and then read them back with --psd-file instead of redoing the PSD estimation each time. I found that this at least makes the list of triggers stable, and often leads to bit-identical HDF files. I have not implemented this for the test cases above, however.
After running a couple test PyGRB workflows, I am fairly confident we can make a new 2.8 release with the optimizations currently on master and finish the analyses with that. However, I am noting here what profiling is showing as dominant costs with the present code* for each test case, as I think there is still scope for further significant optimization.
- finalize_template_events, get_coinc_indexes, chisq values (flowchart)
- get_coinc_indexes (flowchart)
- N/A
- N/A
- N/A
- N/A
- finalize_template_events, chisq values, get_coinc_indexes (flowchart)
- finalize_template_events, chisq values (flowchart)
- Not exactly current master, but close enough… I should probably redo better tests once 2.8.4 is out.
@titodalcanton I'm slowly catching up from the Summer, but to quickly note to one of the things above:
PyCBC does support deterministic FFTs, but one has to use FFTW and do the lowest level planning. I'm not sure whether the PSD estimation is done before or after setting the ContextManager though in pycbc_multi_inspiral ...
Looking at get_coinc_indexes (indices??) I think I can get a significant speedup by porting to Cython, but it would need some specialised functions (e.g. I have two detectors and I need to find 2-detector coincidences, or I have 3 detectors and need 2 detector coincidences ...). The old code could remain a fallback.
SOme questions though:
- Is the input list of indices always sorted, for each ifo? I expect it would be, and as I think the Cython algorithm would be O(N), then a O(NlogN) sort would be a problem!
- Would the list of indices be unique, for each detector? I don't think the logic makes sense unless it is ... But then I'm confused why the 1-detector case calls
np.uniquerather than just treating this function as the no-op I think it is.