stingray icon indicating copy to clipboard operation
stingray copied to clipboard

A new backend for Lomb Scargle Fourier Transform based on Low Rank Approximation

Open pupperemeritus opened this issue 6 months ago • 13 comments

The paper introducing the method

The method seems to provide much closer accuracy to the slow method, while maintaining the same time complexity (just scaled up by a constant). I suggest we parallellize the FFTs in the LRA optionally using numba. This method is currently in development at astropy. Its a much smaller hassle to implement in Stingray.jl as FastTransforms.jl has an implementation. But nonetheless that is not really my forte someone else can take it up.

TLDR

I propose implementing a newer, high-performance algorithm for the Lomb-Scargle periodogram based on the Non-Uniform Fast Fourier Transform (NUFFT) via Low-Rank Approximation (LRA). This method offers significant advantages over the traditional fast method (Press & Rybicki) in terms of parallelism, accuracy control, and memory efficiency.

This method is slowly being adopted already in the FOSS community, with a reference implementation in development for astropy (astropy/astropy#17842).

The Method

The core idea is to re-frame the non uniform discrete fourier transform problem. Instead of "spreading" non-uniform data onto a uniform grid (like Press & Rybicki), this method decomposes the non-uniform transform matrix itself into a sum of a small, fixed number (K) of diagonally-scaled, uniform DFTs.

Paper: A Nonuniform Fast Fourier Transform based on Low Rank Approximation

Key Advantages

  • Parallelism: The algorithm consists of K (typically ~16 for double precision) completely independent FFTs. This is a perfect fit for modern multi-core CPUs and can be parallelized trivially using tools like numba, potentially leading to a massive speedup over the mostly-sequential Press & Rybicki method.
  • Tunable and Guaranteed Accuracy: The accuracy is explicitly controlled by the parameter K, which is chosen based on a desired working precision ε. Unlike the spreading method, where accuracy is indirectly tied to the oversampling factor, this method provides a direct and a priori error bound. This means we can offer users a much more predictable trade-off between speed and precision. It is less costly than the oversampling in the Press & Rybicki method.
  • Memory Efficiency: This method avoids creating a large, oversampled grid for the FFT. It operates on vectors of the original data size N, making it more memory-friendly, especially for very large datasets where the oversampling grid of other methods might become a bottleneck.

Implementation Pointers

  • The K independent FFTs are an ideal use case for numba.njit(parallel=True). A prange loop could compute all K FFTs concurrently.
  • The Astropy PR has an implementation we can refer to in order to modify our lsft_fast in fourier.py

pupperemeritus avatar Jun 23 '25 14:06 pupperemeritus

Hi @pupperemeritus, it's very interesting. Isn't it a duplicate of the astropy implementation though? What would we add to it?

matteobachetti avatar Jun 24 '25 15:06 matteobachetti

Astropy hasn't completed their implementation yet. And I think it would help stingray maintain feature parity. Since they are implementing it as a part of trig_sum, they might have greater friction in changing the API of trig_sum and take longer to make it to production.

We could follow two routes here, the lazy one is that we wait for them to push it and simply update our API to choose between the new backend in trig_sum that we already use. This would simply require adding a third option to fast and slow versions.

The more involved but potentially more performant one would be to implement just the necessary parts ourselves. This would provide an added advantage of control for stingray using numba and fftw for acceleration like we already do for some parts of stingray.

Either way I think it is useful to add support for low rank approximation. This method is supposed to have a much closer output to the slow algorithm compared to the press and rybicki method at a small premium in runtime.

pupperemeritus avatar Jun 24 '25 15:06 pupperemeritus

@pupperemeritus I certainly like experimenting with new algorithms that promise to be a lot more accurate; for a matter of fairness, I will accept a new implementation only if it's significantly different from the one proposed in Astropy (not just a copy-paste of someone else's work, even if the license allows it). I am also open to modifying some of the functions in the Astropy implementation by numba-compiling them or using fftw, as long as we credit the original authors properly

matteobachetti avatar Jun 26 '25 08:06 matteobachetti

@pupperemeritus another route might be asking the original author to contribute their implementation to Stingray, and improve on their code through the various optimizations you propose.

matteobachetti avatar Jun 26 '25 08:06 matteobachetti

Yes we definitely would credit the original author of the code @kkrutkowski, authors of the paper Diego Ruiz-Antolin and Alex Townsend. The lazy route would be to update the way we use trig_sum right now. It would be not a new implementation and just an adaptation on our side exposing the options that will in future be added to astropy. For that we would have to wait for the original author to finish his implementation. It'll be a much smaller change that way. Just have to change the options, add an option for the same kwarg and refactor accordingly to accommodate the new method.

If @kkrutkowski would like to separately implement separately here, in case it happens to be faster for him to slot it into stingray's existing infrastructure by simply creating a lsft_fast_lra method and if it provides him any advantages via the use of fftw and numba njit.

on a useful tangent : I am exploring whether whether njit and fftw make a tangible difference to the existing lsft_fast and lsft_slow implementations. Results of that might also give some idea regarding whether it the same optimizations would be beneficial to the LRA method as well. I will another issue with the results once I'm done.

A quick TLDR for kkrutkowski

We would like to hear your thoughts on how to proceed with this. We want to explore which way you would prefer to add the implementation of the LRA NUDFT method to stingray.

For a birdseye view of the current extent of Lomb Scargle implementation in Stingray. The only files relevant to Lomb Scargle are stingray/lombscargle.py, stingray/fourier.py, tests/test_lombscargle.py, docs/notebooks/LombScargleCrossspectrum.ipynb, and docs/notebooks/LombScarglePowerspectrum.ipynb . Lomb Scargle contains the lsft_slow and lsft_fast methods. We chose to use trig_sum from astropy.timeseries.periodograms.lombscargle.implementation.utils.

We have 2 options in order to add the method to stingray

  1. As you are currently implementing the lra method in the same trig_sum, we could wait for you to finish your PR in astropy, and update the option as is. This would be the least amount of work required for both you and us.
  2. If you find that stingray's infrastructure that supports numba and pyfftw as optional dependencies and has methods that already handle the optionality in stingray/utils.py. If there's any possible advantage in performance from them, we could explore this way.

pupperemeritus avatar Jun 26 '25 15:06 pupperemeritus

A quick TLDR for kkrutkowski

  1. As you are currently implementing the lra method in the same trig_sum, we could wait for you to finish your PR in astropy, and update the option as is. This would be the least amount of work required for both you and us.
  2. If you find that stingray's infrastructure that supports numba and pyfftw as optional dependencies and has methods that already handle the optionality in stingray/utils.py. If there's any possible advantage in performance from them, we could explore this way.

Thanks for the tag. Currently I'm a little bit busy, but I'll gladly try to implement it once I find some time (probably within a few days to weeks in the worst case).

Both numba and pyfftw should make the code much faster (and by much I mean probably at least an order of magnitude if implemented correctly), as in my AstroPy PR spreading is the main bottleneck. Also pyFFTW supports multithreaded FFTs, which would give significant performance advantage in that case. I also have the same algorithm implemented in Julia as a point of reference, although it's equivalent to AstroPy's fastchi2 implementation and not just the NuFFT.

I also think waiting for the AstroPy implementation makes little sense, as (if I recall correctly) the astropy/timeseries module is likely not actively maintained. My AstroPy PR will likely go stale and won't get added at least within next months (probably untill timeseries gets a new maintainer, which may take up to a few years), so waiting isn't really viable here.

Also; I'm not too sure how much non-sinusoidal xray emissions would typically be, but generally for variable stars (except the semiregular ones) the FT-based techniques are not too accurate (with FastChi2 being the only reasonable FT-based choice) and as such I would consider some alternative choises - I'm recently favoring an approach based on Gaussian blur, but that may be largely result of my personal preferences.

As for the accuracy it simply doesn't matter much when considering only a single frequency, so I think as long as there is no method equivalent to the AstroPy's 'FastChi2' (or another unstable variant) available it may not be worth actually implementing.

I'll likely follow with a more thorough review (and possibly some recommended changes) in a few days, but for now the project's code structure is totally unfamiliar to me and as such I'll need some time to find which way could work the best.

EDIT; Also I think that if implementing LRA-based NuFFT generally addition of Cholesky decomposition-based implementation of AoV/FastChi2 (akin to the Julia code I've linked) would likely be the most reasonable idea - imho the instability itself isn't just enough of a reason to justify adding a more accurate algorithm which likely would end up indistinguishable from the current 'fast' one in ~ 99% of cases.

kkrutkowski avatar Jun 26 '25 20:06 kkrutkowski

My observations on the performance of the existing lomb scargle fourier transform functions and the way they play with numba njit.

I tried to see whether using the existing lsft_fast and lsft_slow functions with jit even works. lsft_fast doesn't jit due to untyped trig_sum in astropy.

Errors encountered when jitting lsft_fast

numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend) Untyped global name 'trig_sum': Cannot determine Numba type of

File "../miniconda3/lib/python3.11/site-packages/stingray/fourier.py", line 2547: def lsft_fast( # Sum (y_i * cos(wt - wtau)) Sh, Ch = trig_sum(t, y_, df, N, f0, oversampling=oversampling) ^

Evaluation of lsft_slow

lsft_slow works with both parallel=True and fastmath=True when making a simple change that doesn't break anything as far as the tests go. replacing the line

    freqs = np.asanyarray(freqs[np.asanyarray(freqs) >= 0])

with

    freqs = freqs[freqs >= 0]
Compilation Mode Time (s) Peak Memory
No JIT 117.23 4.88 MB
JIT only 88.23 10.92 MB
JIT + fastmath=True 31.39 22.03 MB
JIT + parallel=True 30.20 23.59 MB
JIT + fastmath + parallel 30.03 25.19 MB

Script used to get the above results (with minor function name edits) is here

Evaluation of lsft_fast

  • As far as lsft_fast goes, its currently non-jittable owing to not being able to make changes to astropy's astropy.timeseries.periodograms.lombscargle.implementation.utils. I haven't fully yet untangled the sequence of changes that are required for lsft_fast to be jittable.
  • A middle ground fix is to get around this by monkey patching trig_sum and removing the numba offensive pythonic operations in lsft_fast, but I reckon we'll be monkey patching so much it's almost like rewriting the whole thing. I've partially tried to do it by trying to manually assign a signature to trig_sum and also made the necessary changes before the first trig_sum call (rewriting the deepcopy statement). It falls flat on its face at trig_sum due to the bad tracabiilty of types from lsft_fast into the fft call inside trig_sum. Cannot think of a way to solve this without changing trig_sum itself.
  • As far as my intution goes, the fast method's major computations mostly revolve around the FFTs. So we might get nearly the same amount of performance by only trying to replace the simple np.fft with Stingray's shimmed fft. And I'm not sure even if we reimplement we'll be able to get jit. There's no big advantage to doing that either as we barely have any big loops outside to jit compile. I suggest the fft shimming is the maximum reasonable optimization beyond which any gains are most likely negligible.
  • If the requirement and urgency is there, it might be a better idea to make fresh implementation if trig_sum function independent from astropy that utilizes the pyfftw shim and numba jit shim already provided in stingray.utils.
  • So an implementation of the trig_sum might be inevitable to prevent monkeypatching hell and remove anything tying down the Lomb Scargle section of Stingray from further extension and optimizations. If astropy.timeseries isn't actively maintained, this makes a little more sense.

pupperemeritus avatar Jul 03 '25 20:07 pupperemeritus

  • As far as my intution goes, the fast method's major computations mostly revolve around the FFTs. So we might get nearly the same amount of performance by only trying to replace the simple np.fft with Stingray's shimmed fft. And I'm not sure even if we reimplement we'll be able to get jit. There's no big advantage to doing that either as we barely have any big loops outside to jit compile. I suggest the fft shimming is the maximum reasonable optimization beyond which any gains are most likely negligible.
  • If the requirement and urgency is there, it might be a better idea to make fresh implementation if trig_sum function independent from astropy that utilizes the pyfftw shim and numba jit shim already provided in stingray.utils.
  • So an implementation of the trig_sum might be inevitable to prevent monkeypatching hell and remove anything tying down the Lomb Scargle section of Stingray from further extension and optimizations. If astropy.timeseries isn't actively maintained, this makes a little more sense.

I think reimplementing the trig_sum would be the best idea here imho. I'm also not sure which variant of the periodogram (Lomb/Scargle/Generalized Lomb/Generalized Scargle) would you like to set as the default - in my opinion the Generalized Lomb likely would be the best choice, however it still depends on many factors.

Also in the 'fast' implementation the FFT/spreading can be divided into many smaller blocks (with could use concurrent.futures from the Python standard library) - that way the jit would give significantly more speed, but also it would requite slightly more significant logic changes.

As for the performance you can pretty easily replace trig functions (or at least most of them) with complex multiplication, which gives performance increase significantly better, than either parallel/fast math jitting with no logic changes if the frequency grid is uniformly (and I see no reason for it not to be).

Below are results of roughly modified my suggestion of modified Cython implementation of the LS for AstroPy - sadly in yours script both Parallel and FastMath variants keep segfaulting, so I have only basic JITs (with fastmath=False and parallel=False) to compare). Here are the results;

Version Time Peak memory
No JIT 239.53 s 4.88 MB
JIT only 131.10 s 11.96 MB
JIT + recursion 22.90 s 12.10 MB

ls_bench.py.txt

I guess the JIT + fastmath + parallel + recursion would get significantly faster, than that (should get sub-10 seconds on my machine), but implementing it would take significantly longer, so I'll leave it like that for now.

kkrutkowski avatar Jul 05 '25 02:07 kkrutkowski

As for the performance you can pretty easily replace trig functions (or at least most of them) with complex multiplication, which gives performance increase significantly better, than either parallel/fast math jitting with no logic changes if the frequency grid is uniformly (and I see no reason for it not to be).

_autofrequency returns regular frequency grids, so I would say thats a fair assumption.

Also in the 'fast' implementation the FFT/spreading can be divided into many smaller blocks (with could use concurrent.futures from the Python standard library) - that way the jit would give significantly more speed, but also it would requite slightly more significant logic changes.

I am kind of leaning towards just using the shims stingray already has for pyfftw that falls back onto numpy's fft. I'm not sure if dividing the FFT into smaller blocks and using concurrent.futures with jit, fastmath and parallel options of numba will beat pyfftw. On the other hand, the spreading part might provide a decent jump in performance. It might be possible but it will take a gargantuan amount of refactoring and experimentation.

Below are results of roughly modified my suggestion of modified Cython implementation of the LS for AstroPy - sadly in yours script both Parallel and FastMath variants keep segfaulting, so I have only basic JITs (with fastmath=False and parallel=False) to compare). Here are the results;

Any idea on why it is segfaulting Its kind of funky jit+fastmath+parallel on the lsft_rec is working without a hitch on my machine. But looks like its slower than the vanilla slow implementation non jitted. Fastmath alone is giving the best result.

For N = 10000

Mode Time (s) Current Memory (MB) Peak Memory (MB)
rec_jit 8.44 9.885 11.886
rec_jit_f 7.86 7.746 9.747
rec_jit_fp 125.06 24.994 26.994
rec_jit_p 119.26 26.130 28.131
slow_jit_f 35.21 26.189 27.070
slow_jit_p 30.26 21.057 21.938
slow_jit 87.09 6.056 7.737
slow_jit_fp 31.03 21.252 22.133
slow 118.10 1.600 4.881
fast 0.0474 1.603 27.502

I've tried to see if its possible to monkey patch astropy into using pyfftw instead of np.fft. It is possible but quite dirty as it affects every call of numpy.fft. Which may or may not cause hard to catch bugs in the future. Perhaps we can move the import statements into lsft_* to isolate the replacement to that part only . As of now I've tested these changes with both the quick and slow tests of stingray, and everything passes.

import numpy as np
import numpy.typing as npt

from stingray.utils import HAS_PYFFTW

if HAS_PYFFTW:
    import pyfftw
    from pyfftw.interfaces import numpy_fft as _fftw
    import numpy.fft as _npfft

    pyfftw.interfaces.cache.enable() 
    _npfft.fft = _fftw.fft
    _npfft.ifft = _fftw.ifft
    _npfft.fftn = _fftw.fftn
    _npfft.ifftn = _fftw.ifftn
    _npfft.rfft = _fftw.rfft
    _npfft.irfft = _fftw.irfft

from astropy.table import Table
from astropy.timeseries.periodograms.lombscargle.implementations.utils import trig_sum

from .gti import (
    generate_indices_of_segment_boundaries_binned,
    generate_indices_of_segment_boundaries_unbinned,
)

N (Data Points) PyFFTW Time Taken (s) Current Memory (MB) Peak Memory (MB)
10,000 Monkey-patched 0.0530 35.165 57.065
10,000 Stock 0.0447 1.603 27.502
100,000 Monkey-patched 1.0320 16.007 738.301
100,000 Stock 0.8333 16.003 375.641
1,000,000 Monkey-patched 15.9147 160.007 6040.721
1,000,000 Stock 27.2938 160.003 3219.490

As we can see pyfftw is huge on eating memory but it makes up for it with performance on larger datasets. Its not as fast due to the overheads of planning and preallocation. If we move the imports into the function itself, we can heuristically switch between numpy and pyfftw based on the crossover point of performance.

pupperemeritus avatar Jul 05 '25 12:07 pupperemeritus

Thanks for the discussion, I'm learning some new concepts here. One nitpick and one request for clarification:

As for the performance you can pretty easily replace trig functions (or at least most of them) with complex multiplication, which gives performance increase significantly better, than either parallel/fast math jitting with no logic changes if the frequency grid is uniformly (and I see no reason for it not to be).

_autofrequency returns regular frequency grids, so I would say thats a fair assumption.

This is the default behavior, but if the user asks we can also calculate the quantities on non-regular grids. We should think about a warning system that reverts to the old method if this is not the case

Also question to @kkrutkowski : Stingray's implementation has not just the LS periodogram, which many other codes have. We have the Fourier transform and the Cross Spectrum of unevenly sampled data based on Jeff Scargle's original code, that we have sped up using the Press & Rybicki method. Is there anything in your code that assumes at some point the fact of calculating the periodogram? Or is it also valid for this more general case?

Also: our code currently has a problem with measuring time delays with the cross spectrum, there is something wrong in the phases. I would be curious if this effect goes away with the increase of precision that the LRA method provides.

matteobachetti avatar Jul 05 '25 16:07 matteobachetti

Also question to @kkrutkowski : Stingray's implementation has not just the LS periodogram, which many other codes have. We have the Fourier transform and the Cross Spectrum of unevenly sampled data based on Jeff Scargle's original code, that we have sped up using the Press & Rybicki method. Is there anything in your code that assumes at some point the fact of calculating the periodogram? Or is it also valid for this more general case?

Generally it reuses weights array for multiple 'y' arrays (three in the case of the basic Generalized Lomb), making their computation significantly faster.

It definitely can be done for standard NDFT - and I think it should be done. The most reasonable way would be to properly define a function used to compute/approximate the NDFT of type 1 and call it in all cases where the grid is uniform (wheter the slow or fast implementation is selected), mostly selecting/fixing AstroPy's trig_sum utility (which has a lot of issues imho).

There are many things that can be improved - the one most obvious 'fix' is changing the default type of 'y' from a 1D vector into a vector of vectors (NumPy's ndarray structure likely would be the most reasonable one, however there may be an alternative with potentially better memory layout I'm not aware of) - this one could significantly speed-up the 'recursive' vectorized implementation and the Low Rank Approximation in many cases.

The AstroPy's 'fast' LS uses C2C FFT, while it can easily be modified to use the R2C saving nearly half of memory allocated and the computation time.

Also I think the sinc² convolution in the time domain I've linked in my first answer here could be added as well - it's likely an algorithm unpublished in the literature yet, however generally it's both significantly faster and more stable (although less accurate) than the Press-Rybicki approach.

As for the execution time/speed the speed-ups caused by recursion/parallelization are still nearly order of magnitude lower, than what I've gotten while writing my C/Julia codes in the past - I guess the Numba's autovectorizer gets lost because of branching which would require rewriting more Cython-friendly 'if' structure into more deeply nested loop using Numba's prange feature explicitely - I'll have to try implementing it later.

I may try implementing trig_sum/ndft1/nudft1 utility for Stingray some time in the future (however the codebase is still mostly a black magic - I'll have to spend significantly more time to familiarize myself with both the codebase I'll work with and the way Numba compiler/optimizer works for it to make any sense).

kkrutkowski avatar Jul 05 '25 17:07 kkrutkowski

@kkrutkowski thanks! To reduce the black magic feeling, I think that if you open a PR with your current best implementation as a bunch of functions in fourier.py (where we have most Fourier calculations), @pupperemeritus and I can help you adjust the inputs and output to something that Stingray can use, and then we can build from there in subsequent PRs to modify or create the relevant Stingray classes. What do you think?

matteobachetti avatar Jul 05 '25 19:07 matteobachetti

Also: our code currently has a problem with measuring time delays with the cross spectrum, there is something wrong in the phases. I would be curious if this effect goes away with the increase of precision that the LRA method provides.

Is the problem across the cross spectrum there in both the fast and the slow versions? From inferencing the words, I understand its a press & rybicki specific problem. Perhaps it deserves its own issue? Would love to take a look at the details to see if its a mistake in the implementation or a quirk of the method(s) itself.

pupperemeritus avatar Jul 05 '25 19:07 pupperemeritus